You are here
Richard - Tue, 2022/04/12 - 08:30
Just an FYI I'm getting this issue on my TKL containers where sshd stops running, version below.
It seems there was a recent update that caused it.
Workaround 1 to fix it from here:
Followed by:
service sshd restart
On a related note, can you please change the update interval to weekly (or even monthly) rather than daily in /etc/cron.d/cron-apt? I changed mine to Sunday:
54 3 * * SUN root test -x /usr/sbin/cron-apt && /usr/sbin/cron-apt
Distributor ID: Debian Description: Debian GNU/Linux 10 (buster) Release: 10 Codename: buster
Forum:
Tags:
TKL version
TurnKey GNU/Linux 16.1 (Debian 10/Buster)
Hmm, that's strange?!
That certainly doesn't sound good! I assume that you used the Proxmox pct tool to enter the container and rescue it?! Regardless, thanks tons for posting. I'm sure it'll be helpful for others.
FWIW I just tried to recreate the issue with our v16.1 WordPress LXC container (running on Proxmox). I installed the security updates and it's still working (my SSH connection did glitch for a moment, but came good fairly quickly so I assume it was just because SSH was restarted). I then also did all available updates just in case, and that made no difference either. I have tested rebooting and SSH consistently comes back up ok?! So there must be some specific difference between our setups? I'm still running an older version of Proxmox, so perhaps that's the difference?
Regardless, as noted in your link, /var/run should be a symlink to /run. The sshd directory, should exist in /run (on my system it's an empty directory, but it's there). So on a properly configured system, /var/run/sshd should definitely exist and should be the same place.
You didn't mention whether it just happens intermittently, or only after a reboot. I don't understand how it might randomly disappear, so I'll assume it's only after a reboot. I could imagine this issue being caused by some sort of race condition on boot. I.e. under some circumstance, perhaps the /run directory has not (yet) been created (it's tmpfs created on the fly at boot) by the time that SSH tries to start (so /var/run/sshd doesn't yet exist)? That would explain why the workaround works. The workaround explicitly makes /run/sshd the path for SSH to use. Systemd is probably smart enough to not try starting ssh until /run has been set up. I'm only guessing though...
However, a closer look at the SSH service file (/usr/lib/systemd/system/ssh.service) shows that the directory ssh should use, is already set to /run/sshd by this line:
The base directory for 'RuntimeDirectory' is /run, so systemd should be creating the required /run/sshd directory when it starts SSH?!
That makes me wonder if you have done some sort of system upgrade/migration steps at some point? If you did a TKLBAM data migration, perhaps something from the old server has inadvertently been included when it shouldn't have? Or perhaps you did an "in place" Debian upgrade? Please share any other details. Also, I'd be interested to see what your ssh.service file includes and if there are any overrides configured. So please share the output of these commands:
Regardless though, if the workaround you noted resolves your problem, then that's a good thing (the workaround seems pretty reasonable - albeit should be unneeded with default config). My only concern is that your wording suggests that perhaps the issue was intermittent? If that was the case, then perhaps something else is going on and the workaround you applied hasn't actually done anything (and the issue just hasn't occurred again since - coincidentally)?!
As to your other point, re cron job adjustment. Your change looks ok in essence, but personally I think that for a production server, installing security updates nightly is the desirable and preferable way to go. So unless I hear a really convincing argument for why it should run less often, I won't be changing the default in TurnKey servers. I might be prompted to include a confconsole plugin to change the frequency, but it isn't currently a priority (you are one of a handful of users to make some sort of complaint about the auto updates in 10 years, so it seems that it suits most users).
Why do you want it to run less often? Without understanding your rationale for wanting to reduce the security updates install window, it's hard to make alternative, superior suggestions. If it's related to network traffic, then setting up apt repo caching (either via apt-cacher-ng, or squid) might be a better way to resolve network traffic concerns.
More info
Hi Jeremy,
The background is: fairly vanilla 16.1 containers running on Proxmox VE 7.1-4. Haven't messed with sshd. Have installed other stuff like crowdsec, web app framework, redis. Deployed from scratch
It happened to 4 containers simultaneously AFAIK, one is nginx & mariadb, the others are TKL core. Yes, proxmox local console to the rescue. It has only happened once so far AFAIK. I don't ssh in every day.
I guess the other explanation is it was caused by the containers being suspended during backup? But not sure why that would kill sshd and nothing else.
So do you think I shouldn't need /usr/lib/tmpfiles.d/sshd.conf ?
Maybe a restart of sshd would have fixed it, but I just Googled the problem and went with the first solution that looked reasonable. I have one container without the file so will see if the problem recurs, but I have made the KillMode=process fix to that so that might change behaviour?
Config requested:
Security updates:
My point of view and rationale is: if it ain't broke don't fix it :-) As far as I'm concerned security vulnerabilities fall into two categories: notified before release and after release.
If maintainers are notified before release then they have plenty of time to fix and release a patch before the world gets to know, so a few more days before the patch is applied makes no difference.
If notified after it will take them a while to fix it anyway, so a few more days of being vulnerable really isn't going to make much difference either.
As far as I'm concerned doing regular updates are much more likely to introduce issues themselves by interfering with applications, or simply being a badly tested patch. Microsoft update has got the world into bad habits. Any change on a production server should really go through a change control procedure where it is tested first, but who has the time? So by delaying it, you're allowing other people to do it first and fix it before you apply the patch. I see every update as Russian roulette. The fewer you do, the better, within reason. Plus I was getting inundated with cron-apt emails from every container.
Strange issue!?
All sounds good so far ... (BTW crowdsec looks awesome - any more to say about your experiences with that?)
The fact that it happened somewhat simultaneously to 4 TurnKey v16.x containers certainly does suggest some common cause. I'm assuming that you didn't have any non-TurnKey containers affected, but just to be clear, did you have any other containers running on the same host that weren't affected (TurnKey and/or otherwise)? If so, what were they?
Yeah, I agree, I wouldn't have expected that to be the cause. Although I guess that'd be fairly easy to test?!
Short answer: I don't know, but I don't think so.
Long answer: This is what we know for sure:
So we still don't actually have any idea what caused the issue. Thus all we can say about your changes is, they haven't made things worse. I suspect that they aren't actually doing anything, but you'd need to test to be sure. The fact that you didn't implement the change on one of the 4, and that is also still fine lends weight to the idea that your change makes no difference.
As I noted above, unless you diagnosed the issue or at least tried restarting ssh, then we have no idea whether your change made any difference or not. (But as I explained, probably not).
Initially I was like, "what 'KillMode=process fix' are you talking about?!?" 'KillMode=process' is in the default ssh.service file. It's in your output and it looks exactly like mine?! I even checked within Debian and it looks like it was added to ssh.service with the initial systemd support - 8 years ago.
Then I remembered your other thread. I missed that you were editing the template file (i.e. ssh@.service) not the default service file (ssh.service) and I didn't actually double check the ssh.service file and realise.
By default, ssh uses a single long running service to manage all ssh connections and traffic. The service file is uses is 'ssh.service'.
I'll post on your other thread too, but unless you've made some SSH config changes (which your output here suggests you haven't), what you've noted on your other thread would not make any difference (the ssh@.service file is never used by default).
It can be configured to instead use multiple ssh services, one for each connection. I'm not particularly familiar with that configuration, but AFAIK, it uses a socket and each ssh session is triggered on connection. Scenarios like that (i.e. multiple instances of a service) use a service template file, in this case, an instance of 'ssh@.service'. You can tell it's a template because of the '@' symbol.
Circling back to this; ah ok. So on at least one server, you just restarted SSH and that appeared to be sufficient?
That all looks fine. Your service matches mine (as per the Debian default) and the other output suggests that you are running the same single long running SSH service as per default.
So in summary, we still don't know why SSH stopped, but it seems likely that simply restarting it was sufficient to get it running again.
Unfortunately, I didn't have any long running v16.x containers (I have v16.x VMs, but not CTs). As I've noted, I haven't been able to recreate any of the issues you've reported.
I have had a look at SSH updates, and there haven't been any since the app was built (so it hasn't been updated). So it wasn't an SSH update that caused any of your issues. There have been updates to systemd since build, but they were some time ago too, and the most recent is a security update, so should have been installed within 24 hours of your initial launch (either at firstboot or the first cron-apt run within 24 hours) - unless of course you changed that before cron-apt ran?
So unfortunately, why SSH stopped on these 4 servers is still a mystery... I personally really hate problems like that. I sort of got used to it with Windows, but find it much rarer on Linux. Although if a restart appears to have fixed it and it doesn't stop again anytime soon, then perhaps the "why" doesn't really matter that much?!
I'm inclined to agree. Although it could also fairly be argued that if the Debian security team have released an update, then something is broken! :) They don't do that lightly...
I get your point and it's not an unreasonable rationale. Although your argument could be used to argue for daily updates instead of hourly, so the timeframe is somewhat arbitrary. So it comes down to which timeframe is the most appropriate. We settled on daily as it seems like the best balance to us. You disagree, which is fair enough. But I'm not sure that your argument is convincing enough for me to change time tested config.
I 100% agree! For what it's worth, the 'stable' in Debian stable doesn't refer to how stable the software itself is (although generally it is). It actually refers to the behaviour of all the software. It's been well documented that bugs in some specific software have explicitly been left unpatched in Debian stable for the explicit reason that it would change expected behaviour!
As an ex-Windows administrator, I would add the qualifier of "recent" or "modern" Microsoft update! I pity the poor fool who left auto updates running in Win XP on a Win Server 2003r2 network (I learned the hard way...). It's really only been since Windows 10 that I've found the Windows updates robust enough to auto enable (and still sleep at night).
But to your point, again I'm broadly inclined to agree. But that's why we only install the security updates (not all updates). The Debian security team carefully crafts patches that introduce the minimal possible changes to address the security issue.
Even with the security updates enabled, minor CVEs may remain unpatched. Those sometimes just go to the "updates" repo, so are rolled out together at the next "point release" (neither of which TurnKey users will ever get unless they manually install available updates).
Again I understand your concern. I think there is some validity in changing the frequency, even disabling them altogether. But in our experience, the risk is low. For it's entire ~14 year history, TurnKey has always installed available updates from the security repo (only - never updated packages in main or updates). In the time that I've been closely involved with TurnKey (incidentally about the same time we've been based on Debian - roughly 10 years), IIRC the auto security updates have caused 2 hiccups. Both those occasions were more to do with the way that updates are installed that was the issue, rather than the updates themselves. Both these times new dependencies (from main) were required and as our updates (can) only install from "security", the updates failed. This did cause DoS, but no significant data loss.
I do also vaguely recall a Samba sec updates that introduced a regression. But the Debian security team released a revised security update the next day.
So I'm totally open to having how to reduce the frequency documented. I'd even be open to making it easier to change (e.g. via script or a confconsole plugin) but I'm not sure I'd want to change the default behaviour. Actually, I would like to make it so that it could be configured to allow installation of new dependencies from main if needed (thus eliminating the risk of the 2 times it has caused issues). That config should be possible, but I haven't spent tons of time trying to work it out. And obviously it would need testing prior to implementation.
Anyway, apologies on the essay...
Answers
Just to answer your questions:
Crowdsec: I have run fail2ban and liked it but just thought I'd try crowdsec for this server. I haven't really monitored or played with it much (I really should) but it just seemed like a good idea to have detections from the community replicated. fail2ban was blocking so many IPs from ssh attempts alone I figured it was a good idea to share that info with the community. Though there is a lot of trust to put into something with not a great deal of history from what I gather. I have it installed as a "no-api" on my containers, then have the "lapi" and "firewall-bouncer" installed on the proxmox pve. The container services are registered with the pve instance. So in theory any strange internal behaviour should cause a block on the public interface. Though if they're already inside a container, blocking an internal IP isn't going to do much good for inter-container traffic. I do have ports forwarded through to containers though.
I'm not running any other containers, just 4x TKL.
Backups run nightly and I haven't seen any other issues.
Just to recap the timeline:
It seems like sshd was killed for some reason at some point on 3 containers, a service restart probably would have fixed it.
A daemon log snippet from a random time before I noticed the issue on one container. No idea if it's related:
Then I restored a backup of one container and modified, and on that one, for some reason, the socket took over from the service. Since then everything seems to have been stable.
Thanks for the essay anyway!
Thanks for sharing
Crowdsec certainly looks pretty cool. I'll keep a bit of an eye on it...
Also, that log output you shared is using the socket (i.e. the first line notes an instance of the termplate: 'ssh@13-10.0.254.1:22-10.0.0.1:44938.service').
So it's still a crazy mystery...
An update...
Just as a quick update.
I ran into the `Missing privilege separation directory: /var/run/sshd` problem again on a vanilla Debian 11 container and it was caused by sshd crashing/stopping. Still don't know why but it was resolved with a `service sshd restart`.
It seems when the service stops/crashes the socket takes over because I was still able to log in before the restart.
It was noticed because mail-in-a-box runs a status check every night which was running a `sshd -T` which failed with the above error.
Oddly, I had to close the ssh socket connection (log out) and restart sshd from a local console, otherwise I couldn't log in even though the service said it was up.
More crowdsec info:
If you're using it in containers on a parent server and forwarding ports, make sure you enable the
FORWARD rule
incrowdsec-firewall-bouncer.yaml, otherwise bouncer blocks won't affect your forwarded traffic.
It's under active development so make sure you install the latest version.
Once you've learnt the cli tool it's pretty good, I prefer it over fail2ban now. If you like to use recidive to permanently block IPs, crowdsec has the advantage that you get a huge list of blacklisted community IPs out of the box.
The only issue I found was when an ssh rule breach occurred, there appeared to be an internal loop trying to process it which made 300,000 attempts. I raised it as an issue and nothing was done except a 'try the latest version' response even though the loop hadn't changed. I haven't checked back to see if the upgrade made a difference.
Add new comment