Plasma Crash Course - DrKonqi

Wednesday, 4 September 2024

A while ago a colleague of mine asked about our crash infrastructure in Plasma and whether I could give some overview on it. This seems very useful to others as well, I thought. Here I am, telling you all about it!

Our crash infrastructure is comprised of a number of different components.

KCrash: a KDE Framework performing crash interception and prepartion for handover to…
coredumpd: a systemd component performing process core collection and handover to…
DrKonqi: a GUI for crashes sending data to…
Sentry: a web service and UI for tracing and presenting crashes for developers

We’ve already looked at KCrash and coredumpd. Now it is time to look at DrKonqi.

DrKonqi

DrKonqi is the UI that comes up when a crash happens. We’ll explore how it integrates with coredumpd and Sentry.

Crash Pickup

When I outlined the functionality of coredumpd, I mentioned that it starts an instance of systemd-coredump@.service. This not only allows the core dumping itself to be controlled by systemd’s resource control and configuration systems, but it also means other systemd units can tie into the crash handling as well.

That is precisely what we do in DrKonqi. It installs drkonqi-coredump-processor@.service which, among other things, contains the rule:

WantedBy=systemd-coredump@.service

…meaning systemd will not only start systemd-coredump@unique_identifier but also a corresponding drkonqi-coredump-processor@unique_identifier. This is similar to how services start as part of the system boot sequence: they all are “wanted by” or “want” some other service, and that is how systemd knows what to start and when (I am simplifying here 😉). Note that unique_identifier is actually a systemd feature called “instances” — one systemd unit can be instantiated multiple times this way.

drkonqi-coredump-processor

When drkonqi-coredump-processor@unique_identifier runs, it first has some synchronization to do.

As a brief recap from the coredumpd post: coredumpd’s crash collection ends with writing a journald entry that contains all collected data. DrKonqi needs this data, so we wait for it to appear in the journal.

Once the journal entry has arrived, we are good to go and will systemd-socket-activate a helper in the relevant user.

The way this works is a bit tricky: drkonqi-coredump-processor runs as root, but DrKonqi needs to be started as the user the crash happened to. To bridge this gap a new service drkonqi-coredump-launcher comes into play.

drkonqi-coredump-launcher

Every user session has a drkonqi-coredump-launcher.socket systemd unit running that provides a socket. This socket gets connected to by the processor (remember: it is root so it can talk to the user socket). When that happens, an instance of drkonqi-coredump-launcher@.service is started (as the user) and the processor starts streaming the data from journald to the launcher.

The crash has now traveled from the user, through the kernel, to system-level systemd services, and has finally arrived back in the actual user session.

Having been started by systemd and initially received the crash data from the processor, drkonqi-coredump-launcher will now augment that data with the KCrash metadata originally saved to disk by KCrash. Once the crash data is complete, the launcher only needs to find a way to “pick up” the crash. This will usually be DrKonqi, but technically other types of crash pickup are also supported. Most notably, developers can set the environment variable KDE_COREDUMP_NOTIFY=1 to receive system notifications about crashes with an easy way to open gdb for debugging. I’ve already written about this a while ago.

When ready, the launcher will start DrKonqi itself and pass over the complete metadata.

the crashed application
└── kernel
    └── systemd-coredumpd
        ├── systemd-coredumpd@unique_identifier.service
        └── drkonqi-coredump-processor@unique_identifier.service
            ├── drkonqi-coredump-launcher.socket
            └── drkonqi-coredump-launcher@unique_identifier.service
                └── drkonqi

What a journey!

Crash Processing

DrKonqi kicks off crash processing. This is hugely complicated and probably worth its own post. But let’s at least superficially explore what is going on.

The launcher has provided DrKonqi with a mountain of information so it can now utilize the CLI for systemd-coredump, called coredumpctl, to access the core dump and attach an instance of the debugger GDB to it.

GDB runs as a two step automated process:

Preamble Step

As part of this automation, we run a service called the preamble: a Python program that interfaces with the Python API of GDB. Its most important functionality is to create a well-structured backtrace that can be converted to a Sentry payload. Sentry, for the most part, doesn’t ingest platform specific core dumps or crash reports, but instead relies on an abstract payload format that is generated by so called Sentry SDKs. DrKonqi essentially acts as such an SDK for us. Once the preamble is done, the payload is transferred into DrKonqi and the next step can continue.

Trace Step

After the preamble, DrKonqi executes an actual GDB trace (i.e. the literal backtrace command in gdb) to generate the developer output. This is also the trace that gets sent to KDE’s Bugzilla instance at bugs.kde.org if the user chooses to file a bug report. The reason this is separate from the already created backtrace is mostly for historic reasons. The trace is then routed through a text parser to figure out if it is of sufficient quality; only when that is the case will DrKonqi allow filing a report in Bugzilla.

Transmission

With all the trace data assembled, we just need to send them off to Bugzilla or Sentry, depending on what the user chose to do.

Bugzilla

The Bugzilla case is simply sending a very long string of the backtrace to the Bugzilla API (albeit surrounded by some JSON).

Sentry

The Sentry case on the other hand requires more finesse. For starters, the Sentry code also works when offline. The trace and optional user message get converted into a Sentry envelope tagged with a receiver address — a Sentry-specific URL for ingestion so it knows under which project to file the crash. The envelope is then written to ~/.cache/drkonqi/sentry-envelopes/. At this point, DrKonqi’s job is done; The actual transmission happens in an auxiliary service.

Writing an envelope to disk triggers drkonqi-sentry-postman.service which will attempt to send all pending envelopes to Sentry using the URL inside the payload. It will try to do so every once in a while in case there are pending envelopes as well, thereby making sure crashes that were collected while offline still make it to Sentry eventually. Once sent successfully, the envelopes are archived in ~/.cache/drkonqi/sentry-sent-envelopes/.

This concludes DrKonqi’s activity. There’s much more detail going on behind the scenes but it’s largely inconsequential to the overall flow. Next time we will look at the final piece in the puzzle — Sentry itself.