2019-08-13

[Status] D'oh - We've been running without OPcache all this time

What a day. Hopefully no more of these for a while.

So. PHP has a feature where it will save the code for the pages it runs (PHP scripts are basically a program that gets run on each page view) in some form after it’s digested the PHP into something it can run fasterm because it skips some of the repetitive prep work to run it. Savvy readers know this as “bytecode”.

This feature is called OPcache, and it’s enabled by default in most PHP installations. But here’s the thing - there’s also an older version of the same idea, called APC. I’m pretty sure it’s mutually incompatible with OPcache, and that’s the problem: CentOS, because its packages are fairly modular, includes OPcache as an optional install, not a default one.

As it turns out, this means that every page load was slamming the disk to re-read the PHP script (somewhat expensive), turning it into bytecode (very expensive), then running it (so-so), when ideally, it would only need to do the last of those steps. Any measure to reduce the number of times a page has to be produced (by using a cache) was, therefore, not really addressing the root issue - burning too much CPU per page load - and the server was buckling under the load.

Since that change, the server has seen a dramatic dropoff in CPU usage - in some cases 20x less, though under load, more like 5x less CPU consumption. I really, /really/ hope this one sticks!

2019-08-13

[Status] Another configuration tweak

Another hiccup this time - a bunch of intermittend 500 errors on the site.

Basically, the problem was that php-fpm was configured to let its processes run indefinitely until they hit some large maximum number of requests, and also to keep a very large number of threads alive to serve them.

So, when demand spiked and a bunch of pages started getting loaded, php-fpm tried to catch up to demand and launch processes to keep up. Problem is, each page load (request, technically) has a timeout (due date), and if it’s late the server goes “oh well, that’s an error” and gives up - otherwise it could be exploited by a malicious client making it keep trying to serve the page forever.

So what happened is that php-fpm wanted to catch up, and launched itself so many times that each individual one had trouble making that due date - hence the site being laggy or down.

Fixing it was a matter of keeping fewer processes around, and forcing them to relaunch after a fixed number of requests so that garbage piling up in their memory would get thrown out every so often.

For good measure, I limited MariaDB’s connection limit since it’s firewalled off from everything but WordPress, and shouldn’t need that many (especially as it should be connecting over socket).

2019-08-11

[Status] Parahumans.net (Ward) WordPress move

This is the second of two updates, concerning the recent WordPress move that most people probably have not noticed.

What people probably have noticed is that the WordPress server stability has been less than great recently. The reason for this is, essentially, unseen screwups in the original setup that I wasn’t looking out for because of AWS credits that finally expired.

Basically, what used to cost near $0 suddenly became $150 per month halfway through. The reasons for this are:

Overprovisioned EC2 server, just in case.
Separate RDS server, in a different Availability Zone
- I used RDS because I didn’t want to have to handle backups, but in practice, the UpdraftPlus extension did its job admirably, making it irrelevant.
Orphaned EBS volumes built up over time when I last refactored the servers

I adjusted #1 down, which mostly worked most of the time. When it didn’t, EC2 choked so badly it couldn’t even force-stop the instance without timing out, delaying any recovery.

I tried to change #2 by creating a multi-AZ mirror and failover, but it still ended up in a different AZ from the EC2 instance and the entire server was down while I was trying this.

I was able to fix #3 without much incident by just carefully auditing the servers. I encountered a bizarre bug where two different EBS disks had the same UUID so I had to learn that I needed to override the check for duplicate UUIDs to inspect the disks for their contents and backup as necessary.

Throughout all this, the server logging was also not up to snuff as the WordPress docker image was pretty simple to get running but running a server in a Docker network in a VPC made configuring the proxy protocol too much of a headache, hence the server was never really accounting for IP addresses properly. After looking at the TCO, I figured that switching to a different VPS was going to be worthwhile - better $150/year than optimizing down to, at best, a multiple of that while wrangling with AWS services. It’s less elastic, but we don’t really need elasticity for this.

So, I set MariaDB, nginx, PHP, letsencrypt, and WordPress up (bonus points: I got to migrate off of a grossly out-of-date traefik and a letsencrypt cron docker image which didn’t run periodically as it advertised), and went to town on moving it over. The result is a mostly transparent move, though there may be some hiccups (tag me [Yewnyx] in the Parahumans Discord to let me know).

Also, I discovered that the tiered caching I put in place (first CloudFlare, then the WP extensions) was saving substantial CPU. I discovered this in part because now that the logs aren’t all messed up, I was able to notice a scraper scraping the site every couple of seconds. Wildbow writes fast, but not THAT fast. Whoever is doing that, kindly slow the fuck down, please!

Anyways, fingers crossed things should be more stable now - and a great deal cheaper for me, too.

P.S. Shoutout to etckeeper for helping me feel more confident in recording the setup process so I can repeat it later. Orchestration and containerization are cool but sometimes you just want to set up a server.

2019-08-11

[Status] IRC to Discord Move

Been a while since the last update, and many things have happened. This is the first of two updates, which are split because the IRC move is complete at this point, and old news to most prior participants in the IRC server or unofficial Discord communities.

IRC to Discord Move

Why not IRC

Firstly, the IRC server is finally down. There has been some discussion on why this happened, so I’d like to clear the air on this:

The #1 top reason was that I spent far too much time stressing over how to keep it online in the least disruptive way possible.

Reason #2 was I wanted make sure spam didn’t overrun the server, but IRC doesn’t give you many good options except subscribing to DNS blacklists. That would be fine…if the blacklist configuration didn’t regularly kick off IRCCloud.

Even stranger, even updating the config file and ensuring the blacklist module was loaded into the server wouldn’t work: it would mysteriously revert to an old configuration that lacked up-to-date server exemptions. I have my suspicions that this was a gnarly issue with loading the C++ module improperly but ultimately this wasn’t fixable without restarting the server - potentially multiple times.

Reason #3 was that it was hard to properly ban users, because Anope and InspIRCd had different syntaxes that on more than one occasion I mixed up and ended up banning half the server. Making this worse, there was a contingent of problematic users who regularly made adminning the server harder than it needed to be, which raised the pressure to correctly issue the proper commands, which nobody felt qualified to do but me, as the number of technical staff present dwindled.

Why Discord

It’s basically IRC, but with a lot of the rougher edges sanded off. A move was considered much earlier, but at the time there was less overall confidence that Discord would have longevity. With IRC becoming too much of a burden to maintain, and with greater confidence in Discord’s longevity, factors shifted in its favor, though not without major reservations about losing bot functionality, moving the community and community norms, inclusivity, and differences in channel moderation practices.

To address the loss of bot functionality, the migration was done slowly, with the full changeover waiting until the majority of bot functionality had been reimplemented. To establish community norms, we minimized the amount of roles and role-gated permissions. Not having voice chat was primarily about staying inclusive of the hearing-impaired, but having been recently burned by people abusing a private channel on IRC, we weren’t keen on introducing an unauditable complication to moderation. As for channel moderation practices, it’s been made much easier with the fairly standard Discord bots to help.

2019-04-13

RFC: IRC Server

Hey. Yewnyx here. I admin the Parahumans servers, meaning the Ward Wordpress as well as the Parahumans IRC server. This note is about the Parahumans IRC server; Ward has remained more or less smooth since it was first set up.

So. As some of you have noticed (particularly WeaverDice players), There have been issues with connecting to IRCCloud.

The reason for this is that there is a session limit per IP connecting to the server. This helps us avoid spammers. Obviously, IRCCloud should be exempted from this session limit: they a) identify their users uniquely to us (so moderation actions can stick) and have their own antispam measures, and b) are a popular way to stay connected remotely for a great many users.

The problem we’re having is that for whatever reason, Anope, which provides IRC Services such as OperServ, ChanServ, and NickServ, is on the fritz and I don’t know why. It forgets the exemptions every so often for no reason I can discern. I add the exemptions, they work, and a couple of hours later, poof. Anope notices, and kills the extra sessions as it thinks it is supposed to do.

Maybe restarting Anope can fix this temporarily (and whatever corrupt state its databases might be in, we deal with and scramble to fix afterwards).

However, at this point, I’m a bit burnt out. I’ve been walking on eggshells for a couple of years. Not in terms of what I say but in terms of being very, very conservative in what changes I make to the IRC server. The most recent troubles hit me right as I was sick in Tokyo, then flew back to the west coast, and was sick and jet lagged there - and it caused a lot of stress for an experience which I’m not keen on repeating again.

I’m not going to be happy if I continue to maintain the IRC server as such. It’s been pretty stable: we can count the number of times we’ve put you all through a server reboot on one hand. But going forward, we have 3 options:

Train up some current trusted people in how to run all aspects of an IRC server.
Put out feelers for someone outside the server to take it on (and build up trust), or
Migrate to a system which is more stable and easy to admin (heavily implies Discord).

Personally, I heavily favor #3, as Discord:

Is easier to write bots for
- We could script WeaverDice channel creation
Makes moderation less janky and messy
Allows channels to be bundled together in categories
Makes channels more discoverable
Integrates well with Patreon (however this is pretty much off the table where a Parahumans discord is concerned)

And additionally, any transition to a different IRC setup is going to be messy and difficult, no matter how smooth we can make it for you to switch over.

Migrating to Discord isn’t without risk, however. As intimidating as IRC is for some people, and as accessible as Discord is, from a certain perspective that is a blessing in disguise. The IRC is more often than not a pleasant, intimate area where like-minded fans can chat together, and where many names are mutually recognizable. With increased accessibility, the community may expand much faster than our ability to moderate it or keep track of discussion; an “Eternal September” scenario is not difficult to imagine.

I’m interested in people’s feedback on this - I’m about at the end of my rope on supporting the IRC server, but the last thing I want to do is leave anyone in the lurch.

–Yewnyx

Addendum

It’s been pointed out to me that I ought to explicitly include some relevant motivating context for this. As I pointed out, I was involved in some rather difficult server administration last week, and in the midst of addressing the IRCCloud connection difficulties, also had to reconfigure the server to remove certain permissions and issue moderator actions. Previously, I’ve accidentally kick-banned many tens of blameless users due to messing up the different forms of user/host specifications in InspIRCd and Anope; so I was extra-careful. This meant that while sick and jetlagged, I was having to be very cautious and careful. I ended up passing out for several hours in between starting and completion of putting these measures in place.

In more principled (rather than anecdotal) terms, I believe the problematic channel itself owed its continued existence to a general lack of discoverability of channels and lack of logging: there was no review for it for a very long time. While personal responsibility (and not living up to it) was a large factor here, the system was also structured in a way that enabled it. In principle I have nothing against private channels, but there should be a reviewable standard of behavior that goes along with that privilege, in my opinion. I favor a system in which on-server conduct is by default reviewable by admins, and that is a notable difference in Discord vs. IRC.

2018-01-03

[Status] Rebooting the server

The server stopped responding for some reason. I’m rebooting and looking into what happened. Sorry for the downtime.

Update 1:

Okay. So I think that EFS is causing massive delays in Wordpress presenting a page, and traefik gives up. In the off chance that it doesn’t a page will show, which explains why some entries showed WordPress getting hit, but for the most part erroring. This would also explain why htop shows a bunch of apache processes in uninterruptible sleep (blocking on IO), and why the server needed a hard reset to access once the issue was seen (everything blocking on NFS IO, computer blew up. A common scenario).

Dammit, EFS.

Update 2:

Yep. Moved that 💩 back into a docker volume and it’s back to its speedy self.

2017-11-19

[Status] Migrating to NFS

I’m migrating the WordPress files to NFS, to make sure they’re persistent and outlive the container, and to make tweaking the theme easier. Minor intermittent downtime expected.

EDIT: Done.

2017-11-19

Favicons!

I’ve added favicons. Many thanks to /u/Aurnyx for the outstanding base design, and Nick (on IRC) for vectoring and exporting them in various sizes! The icon reads surprisingly well even at 32x32, so I’ve exported everything at that size, except for 16x16, where the symbol reads best as the dot and circle.

Note that this may not show up in many browsers until their cache expires and it tries to load the icons again.

Here’s a small sampling of the icon at different sizes:

16x16:
16x16

32x32:
32x32

310x150:
310x150

310x310:
310x310

2017-11-18

[Status] Jetpack activated

Jetpack is now activated and you should be able to subscribe now through WordPress.com instead of spamming the server with RSS feed downloads from IFTTT. :P