gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
/* Handle ROCm Code Objects for GDB, the GNU Debugger.
|
|
|
|
|
2024-01-12 23:30:44 +08:00
|
|
|
Copyright (C) 2019-2024 Free Software Foundation, Inc.
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
This file is part of GDB.
|
|
|
|
|
|
|
|
This program is free software; you can redistribute it and/or modify
|
|
|
|
it under the terms of the GNU General Public License as published by
|
|
|
|
the Free Software Foundation; either version 3 of the License, or
|
|
|
|
(at your option) any later version.
|
|
|
|
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
GNU General Public License for more details.
|
|
|
|
|
|
|
|
You should have received a copy of the GNU General Public License
|
|
|
|
along with this program. If not, see <http://www.gnu.org/licenses/>. */
|
|
|
|
|
|
|
|
|
|
|
|
#include "amd-dbgapi-target.h"
|
|
|
|
#include "amdgpu-tdep.h"
|
|
|
|
#include "arch-utils.h"
|
|
|
|
#include "elf-bfd.h"
|
|
|
|
#include "elf/amdgpu.h"
|
2024-04-23 21:22:59 +08:00
|
|
|
#include "event-top.h"
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
#include "gdbsupport/fileio.h"
|
|
|
|
#include "inferior.h"
|
|
|
|
#include "observable.h"
|
|
|
|
#include "solib.h"
|
|
|
|
#include "solib-svr4.h"
|
|
|
|
#include "solist.h"
|
|
|
|
#include "symfile.h"
|
|
|
|
|
2023-07-20 18:15:50 +08:00
|
|
|
#include <unordered_map>
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
|
|
|
|
/* Per inferior cache of opened file descriptors. */
|
|
|
|
struct rocm_solib_fd_cache
|
|
|
|
{
|
|
|
|
explicit rocm_solib_fd_cache (inferior *inf) : m_inferior (inf) {}
|
|
|
|
DISABLE_COPY_AND_ASSIGN (rocm_solib_fd_cache);
|
|
|
|
|
|
|
|
/* Return a read-only file descriptor to FILENAME and increment the
|
|
|
|
associated reference count.
|
|
|
|
|
|
|
|
Open the file FILENAME if it is not already opened, reuse the existing file
|
|
|
|
descriptor otherwise.
|
|
|
|
|
|
|
|
On error -1 is returned, and TARGET_ERRNO is set. */
|
|
|
|
int open (const std::string &filename, fileio_error *target_errno);
|
|
|
|
|
|
|
|
/* Decrement the reference count to FD and close FD if the reference count
|
|
|
|
reaches 0.
|
|
|
|
|
|
|
|
On success, return 0. On error, return -1 and set TARGET_ERRNO. */
|
|
|
|
int close (int fd, fileio_error *target_errno);
|
|
|
|
|
|
|
|
private:
|
|
|
|
struct refcnt_fd
|
|
|
|
{
|
|
|
|
DISABLE_COPY_AND_ASSIGN (refcnt_fd);
|
|
|
|
refcnt_fd (int fd, int refcnt) : fd (fd), refcnt (refcnt) {}
|
|
|
|
|
|
|
|
int fd = -1;
|
|
|
|
int refcnt = 0;
|
|
|
|
};
|
|
|
|
|
|
|
|
inferior *m_inferior;
|
|
|
|
std::unordered_map<std::string, refcnt_fd> m_cache;
|
|
|
|
};
|
|
|
|
|
|
|
|
int
|
|
|
|
rocm_solib_fd_cache::open (const std::string &filename,
|
|
|
|
fileio_error *target_errno)
|
|
|
|
{
|
|
|
|
auto it = m_cache.find (filename);
|
|
|
|
if (it == m_cache.end ())
|
|
|
|
{
|
|
|
|
/* The file is not yet opened on the target. */
|
|
|
|
int fd
|
|
|
|
= target_fileio_open (m_inferior, filename.c_str (), FILEIO_O_RDONLY,
|
|
|
|
false, 0, target_errno);
|
|
|
|
if (fd != -1)
|
|
|
|
m_cache.emplace (std::piecewise_construct,
|
|
|
|
std::forward_as_tuple (filename),
|
|
|
|
std::forward_as_tuple (fd, 1));
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* The file is already opened. Increment the refcnt and return the
|
|
|
|
already opened FD. */
|
|
|
|
it->second.refcnt++;
|
|
|
|
gdb_assert (it->second.fd != -1);
|
|
|
|
return it->second.fd;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
rocm_solib_fd_cache::close (int fd, fileio_error *target_errno)
|
|
|
|
{
|
|
|
|
using cache_val = std::unordered_map<std::string, refcnt_fd>::value_type;
|
|
|
|
auto it
|
|
|
|
= std::find_if (m_cache.begin (), m_cache.end (),
|
|
|
|
[fd](const cache_val &s) { return s.second.fd == fd; });
|
|
|
|
|
|
|
|
gdb_assert (it != m_cache.end ());
|
|
|
|
|
|
|
|
it->second.refcnt--;
|
|
|
|
if (it->second.refcnt == 0)
|
|
|
|
{
|
|
|
|
int ret = target_fileio_close (it->second.fd, target_errno);
|
|
|
|
m_cache.erase (it);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Keep the FD open for the other users, return success. */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
} /* Anonymous namespace. */
|
|
|
|
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
/* ROCm-specific inferior data. */
|
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
struct rocm_so
|
|
|
|
{
|
|
|
|
rocm_so (const char *name, std::string unique_name, lm_info_svr4_up lm_info)
|
|
|
|
: name (name),
|
|
|
|
unique_name (std::move (unique_name)),
|
|
|
|
lm_info (std::move (lm_info))
|
|
|
|
{}
|
|
|
|
|
|
|
|
std::string name, unique_name;
|
|
|
|
lm_info_svr4_up lm_info;
|
|
|
|
};
|
|
|
|
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
struct solib_info
|
|
|
|
{
|
2023-07-20 18:15:50 +08:00
|
|
|
explicit solib_info (inferior *inf)
|
2023-10-06 04:43:20 +08:00
|
|
|
: fd_cache (inf)
|
2023-07-20 18:15:50 +08:00
|
|
|
{};
|
|
|
|
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
/* List of code objects loaded into the inferior. */
|
2023-10-06 04:43:20 +08:00
|
|
|
std::vector<rocm_so> solib_list;
|
2023-07-20 18:15:50 +08:00
|
|
|
|
|
|
|
/* Cache of opened FD in the inferior. */
|
|
|
|
rocm_solib_fd_cache fd_cache;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/* Per-inferior data key. */
|
|
|
|
static const registry<inferior>::key<solib_info> rocm_solib_data;
|
|
|
|
|
2024-02-06 04:18:34 +08:00
|
|
|
static solib_ops rocm_solib_ops;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
/* Fetch the solib_info data for INF. */
|
|
|
|
|
|
|
|
static struct solib_info *
|
|
|
|
get_solib_info (inferior *inf)
|
|
|
|
{
|
|
|
|
solib_info *info = rocm_solib_data.get (inf);
|
|
|
|
|
|
|
|
if (info == nullptr)
|
2023-07-20 18:15:50 +08:00
|
|
|
info = rocm_solib_data.emplace (inf, inf);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
return info;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Relocate section addresses. */
|
|
|
|
|
|
|
|
static void
|
2024-02-06 04:18:33 +08:00
|
|
|
rocm_solib_relocate_section_addresses (solib &so,
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
struct target_section *sec)
|
|
|
|
{
|
2023-10-11 00:01:08 +08:00
|
|
|
if (!is_amdgpu_arch (gdbarch_from_bfd (so.abfd.get ())))
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
|
|
|
svr4_so_ops.relocate_section_addresses (so, sec);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-10-11 02:47:13 +08:00
|
|
|
auto *li = gdb::checked_static_cast<lm_info_svr4 *> (so.lm_info.get ());
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
sec->addr = sec->addr + li->l_addr;
|
|
|
|
sec->endaddr = sec->endaddr + li->l_addr;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rocm_update_solib_list ();
|
|
|
|
|
|
|
|
static void
|
|
|
|
rocm_solib_handle_event ()
|
|
|
|
{
|
|
|
|
/* Since we sit on top of svr4_so_ops, we might get called following an event
|
|
|
|
concerning host libraries. We must therefore forward the call. If the
|
|
|
|
event was for a ROCm code object, it will be a no-op. On the other hand,
|
|
|
|
if the event was for host libraries, rocm_update_solib_list will be
|
|
|
|
essentially be a no-op (it will reload the same code object list as was
|
|
|
|
previously loaded). */
|
|
|
|
svr4_so_ops.handle_event ();
|
|
|
|
|
|
|
|
rocm_update_solib_list ();
|
|
|
|
}
|
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
/* Create so_list objects from rocm_so objects in SOS. */
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2024-08-13 01:09:05 +08:00
|
|
|
static owning_intrusive_list<solib>
|
2023-10-06 04:43:20 +08:00
|
|
|
so_list_from_rocm_sos (const std::vector<rocm_so> &sos)
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
2024-08-13 01:09:05 +08:00
|
|
|
owning_intrusive_list<solib> dst;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
for (const rocm_so &so : sos)
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
2024-08-13 01:09:05 +08:00
|
|
|
auto &newobj = dst.emplace_back ();
|
2023-10-06 04:43:20 +08:00
|
|
|
|
2024-08-13 01:09:05 +08:00
|
|
|
newobj.lm_info = std::make_unique<lm_info_svr4> (*so.lm_info);
|
|
|
|
newobj.so_name = so.name;
|
|
|
|
newobj.so_original_name = so.unique_name;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return dst;
|
|
|
|
}
|
|
|
|
|
2024-02-06 04:18:33 +08:00
|
|
|
/* Build a list of `struct solib' objects describing the shared
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
objects currently loaded in the inferior. */
|
|
|
|
|
2024-08-13 01:09:05 +08:00
|
|
|
static owning_intrusive_list<solib>
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
rocm_solib_current_sos ()
|
|
|
|
{
|
|
|
|
/* First, retrieve the host-side shared library list. */
|
2024-08-13 01:09:05 +08:00
|
|
|
owning_intrusive_list<solib> sos = svr4_so_ops.current_sos ();
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
/* Then, the device-side shared library list. */
|
2023-10-06 04:43:20 +08:00
|
|
|
std::vector<rocm_so> &dev_sos = get_solib_info (current_inferior ())->solib_list;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
if (dev_sos.empty ())
|
2023-10-19 22:55:38 +08:00
|
|
|
return sos;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2024-08-13 01:09:05 +08:00
|
|
|
owning_intrusive_list<solib> dev_so_list = so_list_from_rocm_sos (dev_sos);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-10-19 22:55:38 +08:00
|
|
|
if (sos.empty ())
|
2023-10-06 04:43:20 +08:00
|
|
|
return dev_so_list;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
/* Append our libraries to the end of the list. */
|
2023-10-19 22:55:38 +08:00
|
|
|
sos.splice (std::move (dev_so_list));
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-10-19 22:55:38 +08:00
|
|
|
return sos;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
|
|
|
|
/* Interface to interact with a ROCm code object stream. */
|
|
|
|
|
2023-08-24 23:23:46 +08:00
|
|
|
struct rocm_code_object_stream : public gdb_bfd_iovec_base
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
|
|
|
DISABLE_COPY_AND_ASSIGN (rocm_code_object_stream);
|
|
|
|
|
2023-08-24 23:23:46 +08:00
|
|
|
int stat (bfd *abfd, struct stat *sb) final override;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-08-24 23:23:46 +08:00
|
|
|
~rocm_code_object_stream () override = default;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
protected:
|
|
|
|
rocm_code_object_stream () = default;
|
|
|
|
|
|
|
|
/* Return the size of the object file, or -1 if the size cannot be
|
|
|
|
determined.
|
|
|
|
|
|
|
|
This is a helper function for stat. */
|
|
|
|
virtual LONGEST size () = 0;
|
|
|
|
};
|
|
|
|
|
|
|
|
int
|
2023-08-24 23:23:46 +08:00
|
|
|
rocm_code_object_stream::stat (bfd *, struct stat *sb)
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
|
|
|
const LONGEST size = this->size ();
|
|
|
|
if (size == -1)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
memset (sb, '\0', sizeof (struct stat));
|
|
|
|
sb->st_size = size;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Interface to a ROCm object stream which is embedded in an ELF file
|
|
|
|
accessible to the debugger. */
|
|
|
|
|
|
|
|
struct rocm_code_object_stream_file final : rocm_code_object_stream
|
|
|
|
{
|
|
|
|
DISABLE_COPY_AND_ASSIGN (rocm_code_object_stream_file);
|
|
|
|
|
2023-07-20 18:15:50 +08:00
|
|
|
rocm_code_object_stream_file (inferior *inf, int fd, ULONGEST offset,
|
|
|
|
ULONGEST size);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-08-24 23:23:46 +08:00
|
|
|
file_ptr read (bfd *abfd, void *buf, file_ptr size,
|
|
|
|
file_ptr offset) override;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
LONGEST size () override;
|
|
|
|
|
|
|
|
~rocm_code_object_stream_file () override;
|
|
|
|
|
|
|
|
protected:
|
|
|
|
|
2023-07-20 18:15:50 +08:00
|
|
|
/* The inferior owning this code object stream. */
|
|
|
|
inferior *m_inf;
|
|
|
|
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
/* The target file descriptor for this stream. */
|
|
|
|
int m_fd;
|
|
|
|
|
|
|
|
/* The offset of the ELF file image in the target file. */
|
|
|
|
ULONGEST m_offset;
|
|
|
|
|
|
|
|
/* The size of the ELF file image. The value 0 means that it was
|
|
|
|
unspecified in the URI descriptor. */
|
|
|
|
ULONGEST m_size;
|
|
|
|
};
|
|
|
|
|
|
|
|
rocm_code_object_stream_file::rocm_code_object_stream_file
|
2023-07-20 18:15:50 +08:00
|
|
|
(inferior *inf, int fd, ULONGEST offset, ULONGEST size)
|
|
|
|
: m_inf (inf), m_fd (fd), m_offset (offset), m_size (size)
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
file_ptr
|
2023-08-24 23:23:46 +08:00
|
|
|
rocm_code_object_stream_file::read (bfd *, void *buf, file_ptr size,
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
file_ptr offset)
|
|
|
|
{
|
|
|
|
fileio_error target_errno;
|
|
|
|
file_ptr nbytes = 0;
|
|
|
|
while (size > 0)
|
|
|
|
{
|
|
|
|
QUIT;
|
|
|
|
|
|
|
|
file_ptr bytes_read
|
|
|
|
= target_fileio_pread (m_fd, static_cast<gdb_byte *> (buf) + nbytes,
|
|
|
|
size, m_offset + offset + nbytes,
|
|
|
|
&target_errno);
|
|
|
|
|
|
|
|
if (bytes_read == 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (bytes_read < 0)
|
|
|
|
{
|
|
|
|
errno = fileio_error_to_host (target_errno);
|
|
|
|
bfd_set_error (bfd_error_system_call);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
nbytes += bytes_read;
|
|
|
|
size -= bytes_read;
|
|
|
|
}
|
|
|
|
|
|
|
|
return nbytes;
|
|
|
|
}
|
|
|
|
|
|
|
|
LONGEST
|
|
|
|
rocm_code_object_stream_file::size ()
|
|
|
|
{
|
|
|
|
if (m_size == 0)
|
|
|
|
{
|
|
|
|
fileio_error target_errno;
|
|
|
|
struct stat stat;
|
|
|
|
if (target_fileio_fstat (m_fd, &stat, &target_errno) < 0)
|
|
|
|
{
|
|
|
|
errno = fileio_error_to_host (target_errno);
|
|
|
|
bfd_set_error (bfd_error_system_call);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check that the offset is valid. */
|
|
|
|
if (m_offset >= stat.st_size)
|
|
|
|
{
|
|
|
|
bfd_set_error (bfd_error_bad_value);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
m_size = stat.st_size - m_offset;
|
|
|
|
}
|
|
|
|
|
|
|
|
return m_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
rocm_code_object_stream_file::~rocm_code_object_stream_file ()
|
|
|
|
{
|
2023-07-20 18:15:50 +08:00
|
|
|
auto info = get_solib_info (m_inf);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
fileio_error target_errno;
|
2023-07-20 18:15:50 +08:00
|
|
|
if (info->fd_cache.close (m_fd, &target_errno) != 0)
|
|
|
|
warning (_("Failed to close solib: %s"),
|
|
|
|
strerror (fileio_error_to_host (target_errno)));
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Interface to a code object which lives in the inferior's memory. */
|
|
|
|
|
|
|
|
struct rocm_code_object_stream_memory final : public rocm_code_object_stream
|
|
|
|
{
|
|
|
|
DISABLE_COPY_AND_ASSIGN (rocm_code_object_stream_memory);
|
|
|
|
|
|
|
|
rocm_code_object_stream_memory (gdb::byte_vector buffer);
|
|
|
|
|
2023-08-24 23:23:46 +08:00
|
|
|
file_ptr read (bfd *abfd, void *buf, file_ptr size,
|
|
|
|
file_ptr offset) override;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
protected:
|
|
|
|
|
|
|
|
/* Snapshot of the original ELF image taken during load. This is done to
|
|
|
|
support the situation where an inferior uses an in-memory image, and
|
|
|
|
releases or re-uses this memory before GDB is done using it. */
|
|
|
|
gdb::byte_vector m_objfile_image;
|
|
|
|
|
|
|
|
LONGEST size () override
|
|
|
|
{
|
|
|
|
return m_objfile_image.size ();
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
rocm_code_object_stream_memory::rocm_code_object_stream_memory
|
|
|
|
(gdb::byte_vector buffer)
|
|
|
|
: m_objfile_image (std::move (buffer))
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
file_ptr
|
2023-08-24 23:23:46 +08:00
|
|
|
rocm_code_object_stream_memory::read (bfd *, void *buf, file_ptr size,
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
file_ptr offset)
|
|
|
|
{
|
|
|
|
if (size > m_objfile_image.size () - offset)
|
|
|
|
size = m_objfile_image.size () - offset;
|
|
|
|
|
|
|
|
memcpy (buf, m_objfile_image.data () + offset, size);
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
|
|
|
} /* anonymous namespace */
|
|
|
|
|
2023-08-24 23:23:46 +08:00
|
|
|
static gdb_bfd_iovec_base *
|
|
|
|
rocm_bfd_iovec_open (bfd *abfd, inferior *inferior)
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
2023-10-13 18:17:02 +08:00
|
|
|
std::string_view uri (bfd_get_filename (abfd));
|
|
|
|
std::string_view protocol_delim = "://";
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
size_t protocol_end = uri.find (protocol_delim);
|
2023-10-13 18:54:46 +08:00
|
|
|
std::string protocol (uri.substr (0, protocol_end));
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
protocol_end += protocol_delim.length ();
|
|
|
|
|
|
|
|
std::transform (protocol.begin (), protocol.end (), protocol.begin (),
|
|
|
|
[] (unsigned char c) { return std::tolower (c); });
|
|
|
|
|
2023-10-13 18:17:02 +08:00
|
|
|
std::string_view path;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
size_t path_end = uri.find_first_of ("#?", protocol_end);
|
|
|
|
if (path_end != std::string::npos)
|
|
|
|
path = uri.substr (protocol_end, path_end++ - protocol_end);
|
|
|
|
else
|
|
|
|
path = uri.substr (protocol_end);
|
|
|
|
|
|
|
|
/* %-decode the string. */
|
|
|
|
std::string decoded_path;
|
|
|
|
decoded_path.reserve (path.length ());
|
|
|
|
for (size_t i = 0; i < path.length (); ++i)
|
|
|
|
if (path[i] == '%'
|
|
|
|
&& i < path.length () - 2
|
|
|
|
&& std::isxdigit (path[i + 1])
|
|
|
|
&& std::isxdigit (path[i + 2]))
|
|
|
|
{
|
2023-10-13 18:17:02 +08:00
|
|
|
std::string_view hex_digits = path.substr (i + 1, 2);
|
2023-10-13 18:54:46 +08:00
|
|
|
decoded_path += std::stoi (std::string (hex_digits), 0, 16);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
i += 2;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
decoded_path += path[i];
|
|
|
|
|
|
|
|
/* Tokenize the query/fragment. */
|
2023-10-13 18:17:02 +08:00
|
|
|
std::vector<std::string_view> tokens;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
size_t pos, last = path_end;
|
|
|
|
while ((pos = uri.find ('&', last)) != std::string::npos)
|
|
|
|
{
|
|
|
|
tokens.emplace_back (uri.substr (last, pos - last));
|
|
|
|
last = pos + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (last != std::string::npos)
|
|
|
|
tokens.emplace_back (uri.substr (last));
|
|
|
|
|
|
|
|
/* Create a tag-value map from the tokenized query/fragment. */
|
2023-10-13 18:17:02 +08:00
|
|
|
std::unordered_map<std::string_view, std::string_view,
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
gdb::string_view_hash> params;
|
2023-10-13 18:17:02 +08:00
|
|
|
for (std::string_view token : tokens)
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
|
|
|
size_t delim = token.find ('=');
|
|
|
|
if (delim != std::string::npos)
|
|
|
|
{
|
2023-10-13 18:17:02 +08:00
|
|
|
std::string_view tag = token.substr (0, delim);
|
|
|
|
std::string_view val = token.substr (delim + 1);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
params.emplace (tag, val);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
try
|
|
|
|
{
|
|
|
|
ULONGEST offset = 0;
|
|
|
|
ULONGEST size = 0;
|
|
|
|
|
2023-10-13 18:17:02 +08:00
|
|
|
auto try_strtoulst = [] (std::string_view v)
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
{
|
|
|
|
errno = 0;
|
|
|
|
ULONGEST value = strtoulst (v.data (), nullptr, 0);
|
|
|
|
if (errno != 0)
|
|
|
|
{
|
|
|
|
/* The actual message doesn't matter, the exception is caught
|
2023-03-02 04:21:53 +08:00
|
|
|
below, transformed in a BFD error, and the message is lost. */
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
error (_("Failed to parse integer."));
|
|
|
|
}
|
|
|
|
|
|
|
|
return value;
|
|
|
|
};
|
|
|
|
|
|
|
|
auto offset_it = params.find ("offset");
|
|
|
|
if (offset_it != params.end ())
|
|
|
|
offset = try_strtoulst (offset_it->second);
|
|
|
|
|
|
|
|
auto size_it = params.find ("size");
|
|
|
|
if (size_it != params.end ())
|
|
|
|
{
|
|
|
|
size = try_strtoulst (size_it->second);
|
|
|
|
if (size == 0)
|
|
|
|
error (_("Invalid size value"));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (protocol == "file")
|
|
|
|
{
|
2023-07-20 18:15:50 +08:00
|
|
|
auto info = get_solib_info (inferior);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
fileio_error target_errno;
|
2023-07-20 18:15:50 +08:00
|
|
|
int fd = info->fd_cache.open (decoded_path, &target_errno);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
if (fd == -1)
|
|
|
|
{
|
|
|
|
errno = fileio_error_to_host (target_errno);
|
|
|
|
bfd_set_error (bfd_error_system_call);
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
|
2023-07-20 18:15:50 +08:00
|
|
|
return new rocm_code_object_stream_file (inferior, fd, offset,
|
|
|
|
size);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (protocol == "memory")
|
|
|
|
{
|
|
|
|
ULONGEST pid = try_strtoulst (path);
|
|
|
|
if (pid != inferior->pid)
|
|
|
|
{
|
|
|
|
warning (_("`%s': code object is from another inferior"),
|
2023-10-13 18:54:46 +08:00
|
|
|
std::string (uri).c_str ());
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
bfd_set_error (bfd_error_bad_value);
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
gdb::byte_vector buffer (size);
|
|
|
|
if (target_read_memory (offset, buffer.data (), size) != 0)
|
|
|
|
{
|
|
|
|
warning (_("Failed to copy the code object from the inferior"));
|
|
|
|
bfd_set_error (bfd_error_bad_value);
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
return new rocm_code_object_stream_memory (std::move (buffer));
|
|
|
|
}
|
|
|
|
|
|
|
|
warning (_("`%s': protocol not supported: %s"),
|
2023-10-13 18:54:46 +08:00
|
|
|
std::string (uri).c_str (), protocol.c_str ());
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
bfd_set_error (bfd_error_bad_value);
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
catch (const gdb_exception_quit &ex)
|
|
|
|
{
|
|
|
|
set_quit_flag ();
|
|
|
|
bfd_set_error (bfd_error_bad_value);
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
catch (const gdb_exception &ex)
|
|
|
|
{
|
|
|
|
bfd_set_error (bfd_error_bad_value);
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static gdb_bfd_ref_ptr
|
|
|
|
rocm_solib_bfd_open (const char *pathname)
|
|
|
|
{
|
|
|
|
/* Handle regular files with SVR4 open. */
|
|
|
|
if (strstr (pathname, "://") == nullptr)
|
|
|
|
return svr4_so_ops.bfd_open (pathname);
|
|
|
|
|
2023-08-24 23:23:46 +08:00
|
|
|
auto open = [] (bfd *nbfd) -> gdb_bfd_iovec_base *
|
|
|
|
{
|
|
|
|
return rocm_bfd_iovec_open (nbfd, current_inferior ());
|
|
|
|
};
|
|
|
|
|
|
|
|
gdb_bfd_ref_ptr abfd = gdb_bfd_openr_iovec (pathname, "elf64-amdgcn", open);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
if (abfd == nullptr)
|
|
|
|
error (_("Could not open `%s' as an executable file: %s"), pathname,
|
|
|
|
bfd_errmsg (bfd_get_error ()));
|
|
|
|
|
|
|
|
/* Check bfd format. */
|
|
|
|
if (!bfd_check_format (abfd.get (), bfd_object))
|
|
|
|
error (_("`%s': not in executable format: %s"),
|
|
|
|
bfd_get_filename (abfd.get ()), bfd_errmsg (bfd_get_error ()));
|
|
|
|
|
|
|
|
unsigned char osabi = elf_elfheader (abfd)->e_ident[EI_OSABI];
|
|
|
|
unsigned char osabiversion = elf_elfheader (abfd)->e_ident[EI_ABIVERSION];
|
|
|
|
|
|
|
|
/* Check that the code object is using the HSA OS ABI. */
|
|
|
|
if (osabi != ELFOSABI_AMDGPU_HSA)
|
|
|
|
error (_("`%s': ELF file OS ABI is not supported (%d)."),
|
|
|
|
bfd_get_filename (abfd.get ()), osabi);
|
|
|
|
|
|
|
|
/* We support HSA code objects V3 and greater. */
|
|
|
|
if (osabiversion < ELFABIVERSION_AMDGPU_HSA_V3)
|
|
|
|
error (_("`%s': ELF file HSA OS ABI version is not supported (%d)."),
|
|
|
|
bfd_get_filename (abfd.get ()), osabiversion);
|
|
|
|
|
gdb/solib-rocm: Detect SO for unsupported AMDGPU device
It is possible to debug a process which uses unsupported AMDGPU devices.
In such scenario, we can still use librocm-dbgapi.so to attach to the
process and complete the runtime activation sequence.
However, when listing shared objects loaded on the AMDGPU devices, we
might list SOs loaded on the unsupported devices. If such SO is
seen, one of two things can happen.
First, if the arch of this device is unknown to BFD,
'gdbarch_find_by_info (gdbarch_info info)' will return the gdbarch
matching default_bfd_arch. As a result,
rocm_solib_relocate_section_addresses will delegate the relocation
operation to svr4_so_ops.relocate_section_addresses, but this makes no
sense: this code object was not loaded by the system loader.
The second case is if BFD knows the micro-architecture of the device,
but dbgapi does not support it. In such case, gdbarch_info_fill will
successfully identify an amdgcn architecture (bfd_arch_amdgcn). From
there, gdbarch_find_by_info calls amdgpu_gdbarch_init which will fail to
query arch specific details from dbgapi and subsequently fail to
initialize the gdbarch object. As a result, gdbarch_find_by_info
returns nullptr, which will down the line cause some "gdb_assert
(gdbarch != nullptr)" assertion failures.
This patch proposes to add a check in rocm_solib_bfd_open to ensure that
the architecture associated with the code object to open is fully
supported by both BFD and amd-dbgapi, and error-out otherwise.
Change-Id: Ica97ab7cba45e4944b77d3080c54c1038aaeda54
Approved-By: Pedro Alves <pedro@palves.net>
2023-07-26 18:30:56 +08:00
|
|
|
/* For GDB to be able to use this solib, the exact AMDGPU processor type
|
|
|
|
must be supported by both BFD and the amd-dbgapi library. */
|
|
|
|
const unsigned char gfx_arch
|
|
|
|
= elf_elfheader (abfd)->e_flags & EF_AMDGPU_MACH ;
|
|
|
|
const bfd_arch_info_type *bfd_arch_info
|
|
|
|
= bfd_lookup_arch (bfd_arch_amdgcn, gfx_arch);
|
|
|
|
|
|
|
|
amd_dbgapi_architecture_id_t architecture_id;
|
|
|
|
amd_dbgapi_status_t dbgapi_query_arch
|
|
|
|
= amd_dbgapi_get_architecture (gfx_arch, &architecture_id);
|
|
|
|
|
|
|
|
if (dbgapi_query_arch != AMD_DBGAPI_STATUS_SUCCESS
|
|
|
|
|| bfd_arch_info == nullptr)
|
|
|
|
{
|
|
|
|
if (dbgapi_query_arch != AMD_DBGAPI_STATUS_SUCCESS
|
|
|
|
&& bfd_arch_info == nullptr)
|
|
|
|
{
|
|
|
|
/* Neither of the libraries knows about this arch, so we cannot
|
|
|
|
provide a human readable name for it. */
|
|
|
|
error (_("'%s': AMDGCN architecture %#02x is not supported."),
|
|
|
|
bfd_get_filename (abfd.get ()), gfx_arch);
|
|
|
|
}
|
|
|
|
else if (dbgapi_query_arch != AMD_DBGAPI_STATUS_SUCCESS)
|
|
|
|
{
|
|
|
|
gdb_assert (bfd_arch_info != nullptr);
|
|
|
|
error (_("'%s': AMDGCN architecture %s not supported by "
|
|
|
|
"amd-dbgapi."),
|
|
|
|
bfd_get_filename (abfd.get ()),
|
|
|
|
bfd_arch_info->printable_name);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
gdb_assert (dbgapi_query_arch == AMD_DBGAPI_STATUS_SUCCESS);
|
|
|
|
char *arch_name;
|
|
|
|
if (amd_dbgapi_architecture_get_info
|
|
|
|
(architecture_id, AMD_DBGAPI_ARCHITECTURE_INFO_NAME,
|
|
|
|
sizeof (arch_name), &arch_name) != AMD_DBGAPI_STATUS_SUCCESS)
|
|
|
|
error ("amd_dbgapi_architecture_get_info call failed for arch "
|
|
|
|
"%#02x.", gfx_arch);
|
|
|
|
gdb::unique_xmalloc_ptr<char> arch_name_cleaner (arch_name);
|
|
|
|
|
|
|
|
error (_("'%s': AMDGCN architecture %s not supported."),
|
|
|
|
bfd_get_filename (abfd.get ()),
|
|
|
|
arch_name);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
gdb_assert (gdbarch_from_bfd (abfd.get ()) != nullptr);
|
|
|
|
gdb_assert (is_amdgpu_arch (gdbarch_from_bfd (abfd.get ())));
|
|
|
|
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
return abfd;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
rocm_solib_create_inferior_hook (int from_tty)
|
|
|
|
{
|
2023-10-06 04:43:20 +08:00
|
|
|
get_solib_info (current_inferior ())->solib_list.clear ();
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
svr4_so_ops.solib_create_inferior_hook (from_tty);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
rocm_update_solib_list ()
|
|
|
|
{
|
|
|
|
inferior *inf = current_inferior ();
|
|
|
|
|
|
|
|
amd_dbgapi_process_id_t process_id = get_amd_dbgapi_process_id (inf);
|
|
|
|
if (process_id.handle == AMD_DBGAPI_PROCESS_NONE.handle)
|
|
|
|
return;
|
|
|
|
|
|
|
|
solib_info *info = get_solib_info (inf);
|
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
info->solib_list.clear ();
|
|
|
|
std::vector<rocm_so> &sos = info->solib_list;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
|
|
|
amd_dbgapi_code_object_id_t *code_object_list;
|
|
|
|
size_t count;
|
|
|
|
|
|
|
|
amd_dbgapi_status_t status
|
|
|
|
= amd_dbgapi_process_code_object_list (process_id, &count,
|
|
|
|
&code_object_list, nullptr);
|
|
|
|
if (status != AMD_DBGAPI_STATUS_SUCCESS)
|
|
|
|
{
|
|
|
|
warning (_("amd_dbgapi_process_code_object_list failed (%s)"),
|
|
|
|
get_status_string (status));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (size_t i = 0; i < count; ++i)
|
|
|
|
{
|
|
|
|
CORE_ADDR l_addr;
|
|
|
|
char *uri_bytes;
|
|
|
|
|
|
|
|
status = amd_dbgapi_code_object_get_info
|
|
|
|
(code_object_list[i], AMD_DBGAPI_CODE_OBJECT_INFO_LOAD_ADDRESS,
|
|
|
|
sizeof (l_addr), &l_addr);
|
|
|
|
if (status != AMD_DBGAPI_STATUS_SUCCESS)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
status = amd_dbgapi_code_object_get_info
|
|
|
|
(code_object_list[i], AMD_DBGAPI_CODE_OBJECT_INFO_URI_NAME,
|
|
|
|
sizeof (uri_bytes), &uri_bytes);
|
|
|
|
if (status != AMD_DBGAPI_STATUS_SUCCESS)
|
|
|
|
continue;
|
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
gdb::unique_xmalloc_ptr<char> uri_bytes_holder (uri_bytes);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-09-14 19:13:24 +08:00
|
|
|
lm_info_svr4_up li = std::make_unique<lm_info_svr4> ();
|
2023-10-06 04:43:20 +08:00
|
|
|
li->l_addr = l_addr;
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
/* Generate a unique name so that code objects with the same URI but
|
|
|
|
different load addresses are seen by gdb core as different shared
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
objects. */
|
2023-10-06 04:43:20 +08:00
|
|
|
std::string unique_name
|
|
|
|
= string_printf ("code_object_%ld", code_object_list[i].handle);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
|
2023-10-06 04:43:20 +08:00
|
|
|
sos.emplace_back (uri_bytes, std::move (unique_name), std::move (li));
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
xfree (code_object_list);
|
|
|
|
|
|
|
|
if (rocm_solib_ops.current_sos == NULL)
|
|
|
|
{
|
|
|
|
/* Override what we need to. */
|
|
|
|
rocm_solib_ops = svr4_so_ops;
|
|
|
|
rocm_solib_ops.current_sos = rocm_solib_current_sos;
|
|
|
|
rocm_solib_ops.solib_create_inferior_hook
|
|
|
|
= rocm_solib_create_inferior_hook;
|
|
|
|
rocm_solib_ops.bfd_open = rocm_solib_bfd_open;
|
|
|
|
rocm_solib_ops.relocate_section_addresses
|
|
|
|
= rocm_solib_relocate_section_addresses;
|
|
|
|
rocm_solib_ops.handle_event = rocm_solib_handle_event;
|
|
|
|
|
|
|
|
/* Engage the ROCm so_ops. */
|
2023-09-30 02:24:35 +08:00
|
|
|
set_gdbarch_so_ops (current_inferior ()->arch (), &rocm_solib_ops);
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
rocm_solib_target_inferior_created (inferior *inf)
|
|
|
|
{
|
2023-10-06 04:43:20 +08:00
|
|
|
get_solib_info (inf)->solib_list.clear ();
|
|
|
|
|
gdb: initial support for ROCm platform (AMDGPU) debugging
This patch adds the foundation for GDB to be able to debug programs
offloaded to AMD GPUs using the AMD ROCm platform [1]. The latest
public release of the ROCm release at the time of writing is 5.4, so
this is what this patch targets.
The ROCm platform allows host programs to schedule bits of code for
execution on GPUs or similar accelerators. The programs running on GPUs
are typically referred to as `kernels` (not related to operating system
kernels).
Programs offloaded with the AMD ROCm platform can be written in the HIP
language [2], OpenCL and OpenMP, but we're going to focus on HIP here.
The HIP language consists of a C++ Runtime API and kernel language.
Here's an example of a very simple HIP program:
#include "hip/hip_runtime.h"
#include <cassert>
__global__ void
do_an_addition (int a, int b, int *out)
{
*out = a + b;
}
int
main ()
{
int *result_ptr, result;
/* Allocate memory for the device to write the result to. */
hipError_t error = hipMalloc (&result_ptr, sizeof (int));
assert (error == hipSuccess);
/* Run `do_an_addition` on one workgroup containing one work item. */
do_an_addition<<<dim3(1), dim3(1), 0, 0>>> (1, 2, result_ptr);
/* Copy result from device to host. Note that this acts as a synchronization
point, waiting for the kernel dispatch to complete. */
error = hipMemcpyDtoH (&result, result_ptr, sizeof (int));
assert (error == hipSuccess);
printf ("result is %d\n", result);
assert (result == 3);
return 0;
}
This program can be compiled with:
$ hipcc simple.cpp -g -O0 -o simple
... where `hipcc` is the HIP compiler, shipped with ROCm releases. This
generates an ELF binary for the host architecture, containing another
ELF binary with the device code. The ELF for the device can be
inspected with:
$ roc-obj-ls simple
1 host-x86_64-unknown-linux file://simple#offset=8192&size=0
1 hipv4-amdgcn-amd-amdhsa--gfx906 file://simple#offset=8192&size=34216
$ roc-obj-extract 'file://simple#offset=8192&size=34216'
$ file simple-offset8192-size34216.co
simple-offset8192-size34216.co: ELF 64-bit LSB shared object, *unknown arch 0xe0* version 1, dynamically linked, with debug_info, not stripped
^
amcgcn architecture that my `file` doesn't know about ----´
Running the program gives the very unimpressive result:
$ ./simple
result is 3
While running, this host program has copied the device program into the
GPU's memory and spawned an execution thread on it. The goal of this
GDB port is to let the user debug host threads and these GPU threads
simultaneously. Here's a sample session using a GDB with this patch
applied:
$ ./gdb -q -nx --data-directory=data-directory ./simple
Reading symbols from ./simple...
(gdb) break do_an_addition
Function "do_an_addition" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (do_an_addition) pending.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5db7640 (LWP 1082911)]
[New Thread 0x7ffef53ff640 (LWP 1082913)]
[Thread 0x7ffef53ff640 (LWP 1082913) exited]
[New Thread 0x7ffdecb53640 (LWP 1083185)]
[New Thread 0x7ffff54bf640 (LWP 1083186)]
[Thread 0x7ffdecb53640 (LWP 1083185) exited]
[Switching to AMDGPU Wave 2:2:1:1 (0,0,0)/0]
Thread 6 hit Breakpoint 1, do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
24 *out = a + b;
(gdb) info inferiors
Num Description Connection Executable
* 1 process 1082907 1 (native) /home/smarchi/build/binutils-gdb-amdgpu/gdb/simple
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff5dc9240 (LWP 1082907) "simple" 0x00007ffff5e9410b in ?? () from /opt/rocm-5.4.0/lib/libhsa-runtime64.so.1
2 Thread 0x7ffff5db7640 (LWP 1082911) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff54bf640 (LWP 1083186) "simple" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
* 6 AMDGPU Wave 2:2:1:1 (0,0,0)/0 do_an_addition (
a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) bt
Python Exception <class 'gdb.error'>: Unhandled dwarf expression opcode 0xe1
#0 do_an_addition (a=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
b=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>,
out=<error reading variable: DWARF-2 expression error: `DW_OP_regx' operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.>) at simple.cpp:24
(gdb) continue
Continuing.
result is 3
warning: Temporarily disabling breakpoints for unloaded shared library "file:///home/smarchi/build/binutils-gdb-amdgpu/gdb/simple#offset=8192&size=67208"
[Thread 0x7ffff54bf640 (LWP 1083186) exited]
[Thread 0x7ffff5db7640 (LWP 1082911) exited]
[Inferior 1 (process 1082907) exited normally]
One thing to notice is the host and GPU threads appearing under
the same inferior. This is a design goal for us, as programmers tend to
think of the threads running on the GPU as part of the same program as
the host threads, so showing them in the same inferior in GDB seems
natural. Also, the host and GPU threads share a global memory space,
which fits the inferior model.
Another thing to notice is the error messages when trying to read
variables or printing a backtrace. This is expected for the moment,
since the AMD GPU compiler produces some DWARF that uses some
non-standard extensions:
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
There were already some patches posted by Zoran Zaric earlier to make
GDB support these extensions:
https://inbox.sourceware.org/gdb-patches/20211105113849.118800-1-zoran.zaric@amd.com/
We think it's better to get the basic support for AMD GPU in first,
which will then give a better justification for GDB to support these
extensions.
GPU threads are named `AMDGPU Wave`: a wave is essentially a hardware
thread using the SIMT (single-instruction, multiple-threads) [3]
execution model.
GDB uses the amd-dbgapi library [4], included in the ROCm platform, for
a few things related to AMD GPU threads debugging. Different components
talk to the library, as show on the following diagram:
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debugger | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | AMDGPU | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
- The amd-dbgapi target is a target_ops implementation used to control
execution of GPU threads. While the debugging of host threads works
by using the ptrace / wait Linux kernel interface (as usual), control
of GPU threads is done through a special interface (dubbed `kfd`)
exposed by the `amdgpu` Linux kernel module. GDB doesn't interact
directly with `kfd`, but instead goes through the amd-dbgapi library
(AMD Debugger API on the diagram).
Since it provides execution control, the amd-dbgapi target should
normally be a process_stratum_target, not just a target_ops. More
on that later.
- The amdgcn gdbarch (describing the hardware architecture of the GPU
execution units) offloads some requests to the amd-dbgapi library,
so that knowledge about the various architectures doesn't need to be
duplicated and baked in GDB. This is for example for things like
the list of registers.
- The solib-rocm component is an solib provider that fetches the list of
code objects loaded on the device from the amd-dbgapi library, and
makes GDB read their symbols. This is very similar to other solib
providers that handle shared libraries, except that here the shared
libraries are the pieces of code loaded on the device.
Given that Linux host threads are managed by the linux-nat target, and
the GPU threads are managed by the amd-dbgapi target, having all threads
appear in the same inferior requires the two targets to be in that
inferior's target stack. However, there can only be one
process_stratum_target in a given target stack, since there can be only
one target per slot. To achieve it, we therefore resort the hack^W
solution of placing the amd-dbgapi target in the arch_stratum slot of
the target stack, on top of the linux-nat target. Doing so allows the
amd-dbgapi target to intercept target calls and handle them if they
concern GPU threads, and offload to beneath otherwise. See
amd_dbgapi_target::fetch_registers for a simple example:
void
amd_dbgapi_target::fetch_registers (struct regcache *regcache, int regno)
{
if (!ptid_is_gpu (regcache->ptid ()))
{
beneath ()->fetch_registers (regcache, regno);
return;
}
// handle it
}
ptids of GPU threads are crafted with the following pattern:
(pid, 1, wave id)
Where pid is the inferior's pid and "wave id" is the wave handle handed
to us by the amd-dbgapi library (in practice, a monotonically
incrementing integer). The idea is that on Linux systems, the
combination (pid != 1, lwp == 1) is not possible. lwp == 1 would always
belong to the init process, which would also have pid == 1 (and it's
improbable for the init process to offload work to the GPU and much less
for the user to debug it). We can therefore differentiate GPU and
non-GPU ptids this way. See ptid_is_gpu for more details.
Note that we believe that this scheme could break down in the context of
containers, where the initial process executed in a container has pid 1
(in its own pid namespace). For instance, if you were to execute a ROCm
program in a container, then spawn a GDB in that container and attach to
the process, it will likely not work. This is a known limitation. A
workaround for this is to have a dummy process (like a shell) fork and
execute the program of interest.
The amd-dbgapi target watches native inferiors, and "attaches" to them
using amd_dbgapi_process_attach, which gives it a notifier fd that is
registered in the event loop (see enable_amd_dbgapi). Note that this
isn't the same "attach" as in PTRACE_ATTACH, but being ptrace-attached
is a precondition for amd_dbgapi_process_attach to work. When the
debugged process enables the ROCm runtime, the amd-dbgapi target gets
notified through that fd, and pushes itself on the target stack of the
inferior. The amd-dbgapi target is then able to intercept target_ops
calls. If the debugged process disables the ROCm runtime, the
amd-dbgapi target unpushes itself from the target stack.
This way, the amd-dbgapi target's footprint stays minimal when debugging
a process that doesn't use the AMD ROCm platform, it does not intercept
target calls.
The amd-dbgapi library is found using pkg-config. Since enabling
support for the amdgpu architecture (amdgpu-tdep.c) depends on the
amd-dbgapi library being present, we have the following logic for
the interaction with --target and --enable-targets:
- if the user explicitly asks for amdgcn support with
--target=amdgcn-*-* or --enable-targets=amdgcn-*-*, we probe for
the amd-dbgapi and fail if not found
- if the user uses --enable-targets=all, we probe for amd-dbgapi,
enable amdgcn support if found, disable amdgcn support if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=yes,
we probe for amd-dbgapi, enable amdgcn if found and fail if not found
- if the user uses --enable-targets=all and --with-amd-dbgapi=no,
we do not probe for amd-dbgapi, disable amdgcn support
- otherwise, amd-dbgapi is not probed for and support for amdgcn is not
enabled
Finally, a simple test is included. It only tests hitting a breakpoint
in device code and resuming execution, pretty much like the example
shown above.
[1] https://docs.amd.com/category/ROCm_v5.4
[2] https://docs.amd.com/bundle/HIP-Programming-Guide-v5.4
[3] https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
[4] https://docs.amd.com/bundle/ROCDebugger-API-Guide-v5.4
Change-Id: I591edca98b8927b1e49e4b0abe4e304765fed9ee
Co-Authored-By: Zoran Zaric <zoran.zaric@amd.com>
Co-Authored-By: Laurent Morichetti <laurent.morichetti@amd.com>
Co-Authored-By: Tony Tye <Tony.Tye@amd.com>
Co-Authored-By: Lancelot SIX <lancelot.six@amd.com>
Co-Authored-By: Pedro Alves <pedro@palves.net>
2023-01-04 04:07:07 +08:00
|
|
|
rocm_update_solib_list ();
|
|
|
|
|
|
|
|
/* Force GDB to reload the solibs. */
|
|
|
|
current_inferior ()->pspace->clear_solib_cache ();
|
|
|
|
solib_add (nullptr, 0, auto_solib_add);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* -Wmissing-prototypes */
|
|
|
|
extern initialize_file_ftype _initialize_rocm_solib;
|
|
|
|
|
|
|
|
void
|
|
|
|
_initialize_rocm_solib ()
|
|
|
|
{
|
|
|
|
/* The dependency on the amd-dbgapi exists because solib-rocm's
|
|
|
|
inferior_created observer needs amd-dbgapi to have attached the process,
|
|
|
|
which happens in amd_dbgapi_target's inferior_created observer. */
|
|
|
|
gdb::observers::inferior_created.attach
|
|
|
|
(rocm_solib_target_inferior_created,
|
|
|
|
"solib-rocm",
|
|
|
|
{ &get_amd_dbgapi_target_inferior_created_observer_token () });
|
|
|
|
}
|