Jeremy Davis's picture

The background is: fairly vanilla 16.1 containers running on Proxmox VE 7.1-4. Haven't messed with sshd. Have installed other stuff like crowdsec, web app framework, redis. Deployed from scratch

All sounds good so far ... (BTW crowdsec looks awesome - any more to say about your experiences with that?)

It happened to 4 containers simultaneously AFAIK, one is nginx & mariadb, the others are TKL core. Yes, proxmox local console to the rescue. It has only happened once so far AFAIK. I don't ssh in every day.

The fact that it happened somewhat simultaneously to 4 TurnKey v16.x containers certainly does suggest some common cause. I'm assuming that you didn't have any non-TurnKey containers affected, but just to be clear, did you have any other containers running on the same host that weren't affected (TurnKey and/or otherwise)? If so, what were they?

I guess the other explanation is it was caused by the containers being suspended during backup? But not sure why that would kill sshd and nothing else.

Yeah, I agree, I wouldn't have expected that to be the cause. Although I guess that'd be fairly easy to test?!

So do you think I shouldn't need /usr/lib/tmpfiles.d/sshd.conf ?

Short answer: I don't know, but I don't think so.

Long answer: This is what we know for sure:

  • SSH stopped simultaneously on 4 TKL v16.1 CTs (at least within a few days of each other)
  • After adding a file: /usr/lib/tmpfiles.d/sshd.conf (to 3 of the 4 - as per your notes elsewhere)and restarting SSH on all 4 servers, it appears to be working fine again
  • SSH has continued to run fine since then

So we still don't actually have any idea what caused the issue. Thus all we can say about your changes is, they haven't made things worse. I suspect that they aren't actually doing anything, but you'd need to test to be sure. The fact that you didn't implement the change on one of the 4, and that is also still fine lends weight to the idea that your change makes no difference.

Maybe a restart of sshd would have fixed it, but I just Googled the problem and went with the first solution that looked reasonable.

As I noted above, unless you diagnosed the issue or at least tried restarting ssh, then we have no idea whether your change made any difference or not. (But as I explained, probably not).

I have one container without the file so will see if the problem recurs, but I have made the KillMode=process fix to that so that might change behaviour?

Initially I was like, "what 'KillMode=process fix' are you talking about?!?" 'KillMode=process' is in the default ssh.service file. It's in your output and it looks exactly like mine?! I even checked within Debian and it looks like it was added to ssh.service with the initial systemd support - 8 years ago.

Then I remembered your other thread. I missed that you were editing the template file (i.e. ssh@.service) not the default service file (ssh.service) and I didn't actually double check the ssh.service file and realise.

By default, ssh uses a single long running service to manage all ssh connections and traffic. The service file is uses is 'ssh.service'.

I'll post on your other thread too, but unless you've made some SSH config changes (which your output here suggests you haven't), what you've noted on your other thread would not make any difference (the ssh@.service file is never used by default).

It can be configured to instead use multiple ssh services, one for each connection. I'm not particularly familiar with that configuration, but AFAIK, it uses a socket and each ssh session is triggered on connection. Scenarios like that (i.e. multiple instances of a service) use a service template file, in this case, an instance of 'ssh@.service'. You can tell it's a template because of the '@' symbol.

I have one container without the file so will see if the problem recurs

Circling back to this; ah ok. So on at least one server, you just restarted SSH and that appeared to be sufficient?

Config requested: [...]

That all looks fine. Your service matches mine (as per the Debian default) and the other output suggests that you are running the same single long running SSH service as per default.

So in summary, we still don't know why SSH stopped, but it seems likely that simply restarting it was sufficient to get it running again.

Unfortunately, I didn't have any long running v16.x containers (I have v16.x VMs, but not CTs). As I've noted, I haven't been able to recreate any of the issues you've reported.

I have had a look at SSH updates, and there haven't been any since the app was built (so it hasn't been updated). So it wasn't an SSH update that caused any of your issues. There have been updates to systemd since build, but they were some time ago too, and the most recent is a security update, so should have been installed within 24 hours of your initial launch (either at firstboot or the first cron-apt run within 24 hours) - unless of course you changed that before cron-apt ran?

So unfortunately, why SSH stopped on these 4 servers is still a mystery... I personally really hate problems like that. I sort of got used to it with Windows, but find it much rarer on Linux. Although if a restart appears to have fixed it and it doesn't stop again anytime soon, then perhaps the "why" doesn't really matter that much?!


My point of view and rationale is: if it ain't broke don't fix it :-)

I'm inclined to agree. Although it could also fairly be argued that if the Debian security team have released an update, then something is broken! :) They don't do that lightly...

As far as I'm concerned security vulnerabilities fall into two categories: notified before release and after release. If maintainers are notified before release then they have plenty of time to fix and release a patch before the world gets to know, so a few more days before the patch is applied makes no difference. If notified after it will take them a while to fix it anyway, so a few more days of being vulnerable really isn't going to make much difference either.

I get your point and it's not an unreasonable rationale. Although your argument could be used to argue for daily updates instead of hourly, so the timeframe is somewhat arbitrary. So it comes down to which timeframe is the most appropriate. We settled on daily as it seems like the best balance to us. You disagree, which is fair enough. But I'm not sure that your argument is convincing enough for me to change time tested config.

As far as I'm concerned doing regular updates are much more likely to introduce issues themselves by interfering with applications, or simply being a badly tested patch.

I 100% agree! For what it's worth, the 'stable' in Debian stable doesn't refer to how stable the software itself is (although generally it is). It actually refers to the behaviour of all the software. It's been well documented that bugs in some specific software have explicitly been left unpatched in Debian stable for the explicit reason that it would change expected behaviour!

Microsoft update has got the world into bad habits. Any change on a production server should really go through a change control procedure where it is tested first, but who has the time?

As an ex-Windows administrator, I would add the qualifier of "recent" or "modern" Microsoft update! I pity the poor fool who left auto updates running in Win XP on a Win Server 2003r2 network (I learned the hard way...). It's really only been since Windows 10 that I've found the Windows updates robust enough to auto enable (and still sleep at night).

But to your point, again I'm broadly inclined to agree. But that's why we only install the security updates (not all updates). The Debian security team carefully crafts patches that introduce the minimal possible changes to address the security issue.

Even with the security updates enabled, minor CVEs may remain unpatched. Those sometimes just go to the "updates" repo, so are rolled out together at the next "point release" (neither of which TurnKey users will ever get unless they manually install available updates).

So by delaying it, you're allowing other people to do it first and fix it before you apply the patch. I see every update as Russian roulette. The fewer you do, the better, within reason. Plus I was getting inundated with cron-apt emails from every container.

Again I understand your concern. I think there is some validity in changing the frequency, even disabling them altogether. But in our experience, the risk is low. For it's entire ~14 year history, TurnKey has always installed available updates from the security repo (only - never updated packages in main or updates). In the time that I've been closely involved with TurnKey (incidentally about the same time we've been based on Debian - roughly 10 years), IIRC the auto security updates have caused 2 hiccups. Both those occasions were more to do with the way that updates are installed that was the issue, rather than the updates themselves. Both these times new dependencies (from main) were required and as our updates (can) only install from "security", the updates failed. This did cause DoS, but no significant data loss.

I do also vaguely recall a Samba sec updates that introduced a regression. But the Debian security team released a revised security update the next day.

So I'm totally open to having how to reduce the frequency documented. I'd even be open to making it easier to change (e.g. via script or a confconsole plugin) but I'm not sure I'd want to change the default behaviour. Actually, I would like to make it so that it could be configured to allow installation of new dependencies from main if needed (thus eliminating the risk of the 2 times it has caused issues). That config should be possible, but I haven't spent tons of time trying to work it out. And obviously it would need testing prior to implementation.

Anyway, apologies on the essay...