Hung server every 5-6 days

Delmela · November 18, 2020, 8:42am

We have a Azure based VPS instance (2vCPU, 4Gb RAM, SSD 30 GB) with an up to date Ubuntu 18.04, a fresh Hestia install with Prestashop e-commerce and a single domain.

User traffic is fairly low, about 100 visits/day to the e-commerce site and average CPU load is about 7% .

Every 5 or 6 days, around 15:30 hours (+/- 10 mins) we are left with a hung server and the only course of action is to turn off the instance and restart it.

We have disabled all mail services, increased max_children to 20 (we were receiving errors in PHP log, “server reached max_children setting (8), consider raising it”) and checked that there aren’t any cron processes running at that time.

Has anything like this happened to anyone? Any idea where to investigate the problem?

Raphael · November 18, 2020, 9:10am

Sounds like a weird issue - I don’t have any hanging servers here, also not sure if it is directly related with hestia.

It looks like it hangs everytime around the same time, any cronjob or something else timebased running? Backup maybe?

Delmela · November 18, 2020, 11:40am

We haven’t found any odd services/cron processes running on the server. Its a little bizzare to say the least…

Raphael · November 18, 2020, 11:42am

How exactly does the server freeze? A complete crash, with loosing network connection? Or could you still access over (kvm) console?

Delmela · November 18, 2020, 12:41pm

The signals we have are: web service time-out, CPU usage around 100%, free all memory ?!, network IN and OUT at 0 mb / sec.
We think that we do not access the console due to the extreme slowness of response.
Shutting down the VM takes 15 minutes, when normal is 2-3 minutes.

AlwaysSkint · November 18, 2020, 2:12pm

Try adding the venerable munin. Look to add in additional plugins, explicitly for multips_memory and CPU/disc activity. It could just be a runaway process, such as the infamous clamd but sounds to me like a buffer/cache isn’t getting trimmed, then running out of resources.
Take a look at the size of files in /var/log/* and ensure they are truncated when they reach a sensible size, say 5MB. A reboot may well be clearing one of these logs, making it less intensive for processes that read/write to it. Then as it grows to GB sizes, in time, the said process gets “its’ knickers in a twist”.

Lupu · November 18, 2020, 7:32pm

Seems like a out of memory issue that triggers excessive swapping.

Like AlwaysSkint said you need to have a monitoring system set-up and watch out for any out of the ordinary activity before the server freezes (including running processes).

Delmela · November 18, 2020, 11:47pm

Azure Linux images do not have swap configured … The reason is that “the user should decide on the size and location of the swap and do it post provisioning” … And that despite having a temporary disk attached for this purpose.
Anyway, I just configured 4GB of swap and in a few days I will know if this was the reason for the apparent lack of resources.
Thanks for your contributions, they have helped us!

AlwaysSkint · November 19, 2020, 12:17am

Excessive. If you’re swapping out even 2GB then something is very amiss and your system will likely crawl due to I/O activity.

falzo · November 19, 2020, 6:39am

still better than getting stuck because OOM and processes getting killed
could at least help to get a better look at what’s going on.

because azure @Delmela - this isn’t a spot instance though, that randomly can get deprovisioned?

Yikmings · November 28, 2020, 8:13am

Do you turn on backup function?