FreelanceWizard Posted November 5, 2014 Share #1 Posted November 5, 2014 Some of you have, I'm sure, noticed the seemingly random error pages you've been getting on the forums over the last couple of weeks. This isn't a server performance issue, but rather a PHP incompatibility buried deep in our MyBB installation. I'm currently trying to debug the issue, which is challenging because it's seemingly random. PHP crashes in the same place each time, but what's provoking the crash isn't something I've been able to repeat. In the meantime, I'm tweaking our server to "hide" the issue by forcing it to retry requests that fail, since that generally causes the error to go away (remember, it's random ). This tweaking is a bit delicate and might result in some weird errors (like the "out of memory" ones yesterday night ). I'll update this thread with additional information as I get it. Meanwhile, if you can repeatedly produce an nginx error page, I'd love to know what you're doing to do it. Just drop me a PM with the details. Thanks as always for your patience! 1 Link to comment
FreelanceWizard Posted November 6, 2014 Author Share #2 Posted November 6, 2014 I believe I've isolated the issue and corrected it; we're not getting tons of segmentation faults now. Without going into all the brutal details, it turns out the reputation system, of all things, does some weird stuff with shared memory that I had to tweak some settings to handle now that people's reputation numbers are getting into triple digits. I'm going to do a bit more tweaking on our FastCGI pools to undo the things I did for debugging, and I'll of course continue to monitor the situation. For now, though, I think we're back to normal. 1 Link to comment
Tiergan Posted November 6, 2014 Share #3 Posted November 6, 2014 Glad to hear that things are okay now! I was really starting to worry. Link to comment
Unnamed Mercenary Posted November 6, 2014 Share #4 Posted November 6, 2014 Interesting. If it's not a security hole, could we get the brutal details? There's a fondness for seeing the explanation for these things, even if the only language I can program effectively in is Java. Link to comment
FreelanceWizard Posted November 6, 2014 Author Share #5 Posted November 6, 2014 Sure. Gory technical details ahoy: Basically, the post reputation system tries to compute the reputations of everyone it sees on a page, and it does this not by loading a counter from the database, but by pulling in the whole database table for reputation entries and running a computation on that. Since I loosened up the reputation system, this table has grown, and it was exceeding the APC (the opcode/data cache) shared memory limit. PHP's response to this was to throw a fatal error and crash with a segmentation fault (because in the world of PHP, any fatal error is a segmentation fault). It was seemingly random because the reputation code is called in a lot of different places, some rather unexpected, and you could actually stuff the table into APC if not a lot of other things had already been put in there (like, say, user sessions). I was only able to work this out by tweaking the FastCGI process manager and nginx to try to capture those errors and the core dumps. The solution was to increase PHP's maximum memory and the size of the shared memory space for APC. Merely turning off APC wasn't sufficient (since the board was grabbing the table more than once and blowing itself up), nor was just turning up the maximum script memory. Now, in terms of how I'd do it if I were writing the code, I'd do one of two things, depending on how my database server behaved. Keeping the reputation entries in a table is a fine idea, and if the DB server can do quick computations on query (as, say, MS SQL Server and Oracle can -- not that anything in Oracle is "quick" per se ), I'd just have it do the aggregation and return a single number to me, something like: select count(postrep.*) from users inner join postrep on users.user_id = postrep.user_id group by postrep.user_id where postrep.user_id = @user I could then kick that over to the server using ExecuteScalar or the equivalent and quickly get an answer back. For getting multiple reputation counts at once, I might instead put postrep.user_id in the select statement and filter the results in the script. LINQ, for instance, makes that super-easy, depending on whether you want to use the Join, Intersect, or Where operators. For something like MySQL, where aggregations at the database aren't fast, or a NoSQL/in-memory store where you really can't do any aggregations, I'd create a field in the users table to hold the current post reputation count for each user and increment it whenever someone got a new reputation bump. Then, I'd just pull the reputation counts out of the user table. Of course, this being MyBB, I can't really do it how I'd want to and still be able to upgrade the code, so I'm stuck with watching logs and debugging when weird stuff happens. 1 Link to comment
FreelanceWizard Posted November 8, 2014 Author Share #6 Posted November 8, 2014 And to follow up... The errors and extreme slowness we experienced this afternoon were due to a MySQL thing I've fixed. (As they say in the Lean world, every time you lower the water level, you end up hitting some new rocks. ) We use SSD storage for our RDS instance, which has something like an IOPS (IOs per second) limit. It can burst to obscene levels (3000 or so IOPS), but the actual normal IOPS depends on how much storage you have allocated. You get 3 IOPS per GB. Bursting is based on a limited set of IOPS credits, and when you run out, RDS locks you down to your purchased rate. You regain credits over time based on the amount of storage you allocate. For reasons of my bank account, I try to keep the allocated storage on RDS to a relatively low level, since I pay $0.115 per GB allocated. I had 20 GB allocated, which is 60 IOPS. I based this on the server monitoring by Amazon. I did not, however, zoom in far enough to see how our average IOPS is impacted minute to minute (lazy Freelance, you should know better!). At this point, any DBAs around can see where this is going... When I fixed the aforementioned PHP problem, our PHP throughput went up a lot. This meant our database IOPS usage went up a lot (more pages processed = more DB queries = more IOPS). We burned through our credit usage and, predictably, this ground MySQL to a halt, and ground the site to a halt. The fix was for me to allocate more storage to handle our real IOPS, after doing some poking around in the Cloudwatch monitoring to work it out. We now have enough storage to have the necessary IOPS so that the server won't melt down under load. I also noticed one more error that might've been causing some grief to people, albeit rarely. Sometimes web browsers ask for weird things, and the response from PHP can have large headers. Our nginx header buffers were too small to support this, which might've caused error pages to appear. I've rejiggered the buffer allocation to use fewer, bigger buffers to fix this. Link to comment
Blue Posted December 6, 2014 Share #7 Posted December 6, 2014 I seem unable to access the wiki in any shape or form at the moment. Does that have anything to do with these errors? Link to comment
FreelanceWizard Posted December 6, 2014 Author Share #8 Posted December 6, 2014 It's likely related to our switch to https. What browser and OS are you using? Link to comment
Blue Posted December 7, 2014 Share #9 Posted December 7, 2014 It's likely related to our switch to https. What browser and OS are you using? Opera 12.17 , Windows 7 I noticed that if I copy and paste the wiki address on google chrome, I can browse it, but if I try to copy a link to a picture in my wiki (on Chrome) and paste it into opera, it won't display it. Link to comment
FreelanceWizard Posted December 7, 2014 Author Share #10 Posted December 7, 2014 Yep, it's related to the TLS cipher issue I discussed on the suggestions thread. I'm currently collecting a list of ciphers to use and will be updating the settings in short order, now that I'm back in town and have the time to do that. Link to comment
FreelanceWizard Posted December 7, 2014 Author Share #11 Posted December 7, 2014 Give it a go now and let me know if works or not. The server should now fall back to a cipher suite that's compatible with Opera 12 and older versions of Android and IE while simultaneously offering super-high security on the latest versions of IE, Chrome, and Firefox. Link to comment
Blue Posted December 11, 2014 Share #12 Posted December 11, 2014 Yup, it's doing great now, thanks! Link to comment
Warren Castille Posted December 11, 2014 Share #13 Posted December 11, 2014 Not sure if it's just me, but I keep getting this to pop up randomly today when I try to click things. MyBB has experienced an internal SQL error and cannot continue. SQL Error: 2002 - php_network_getaddresses: getaddrinfo failed: Name or service not known Query: [READ] Unable to connect to MySQL server Link to comment
Perth Posted December 11, 2014 Share #14 Posted December 11, 2014 Just got that as well, but it seems to have gone away Link to comment
FreelanceWizard Posted December 11, 2014 Author Share #15 Posted December 11, 2014 Yeah, that's an AWS SQL error. I don't see anywhere where our RDS instance has fallen over, but random RDS network outages aren't unknown over in the AWS world. Link to comment
Blue Posted December 12, 2014 Share #16 Posted December 12, 2014 I guess I should mention this now though it has been going on for a long while (it just never really bothered me)... For a couple months now I have no longer been receiving Alerts for being quoted in threads I did not make. I still get Alerts when: - Receiving a PM. - Someone posts on a thread I made. - Someone quotes me in a thread I made. - Someone modifies my reputation. I just don't get alerts for being quoted in others' threads (such as the Screenshot Thread etc.). I've been trying to fix it through settings but I find nothing of the sort. Link to comment
FreelanceWizard Posted December 12, 2014 Author Share #17 Posted December 12, 2014 I'll take a look at it. Link to comment
Blue Posted December 12, 2014 Share #18 Posted December 12, 2014 It's not a terrible hinderance so take your time ^^; I'd hate to make you do extra work when there's other things for you to do. Link to comment
Perth Posted January 1, 2015 Share #19 Posted January 1, 2015 Not a forum error per se but DEFINITELY something that doesn't seem right... I just logged onto my account, then logged out so my boyfriend could log onto his. He had typoed his password at first and on his second attempt was suddenly barred entirely from logging in for half an hour. I hopped on my account so as to forward his message on his behalf only to find that my account too was placed on a 30 minute wait as I type this on my phone. Is there any way that this peculiarly harsh restriction be lessened in the future, please? Link to comment
FreelanceWizard Posted January 1, 2015 Author Share #20 Posted January 1, 2015 Well, it definitely shouldn't be doing that on the second login attempt. We do have a relatively strict lock policy because of brute force attempts on people's accounts in the past. However, it's probably a smidge too aggressive at the moment, so I'll loosen it up a bit. I believe that the forum tracks lockouts on a per-IP basis, though, which would explain why it wouldn't let you in during the lockout period either. 1 Link to comment
Perth Posted January 1, 2015 Share #21 Posted January 1, 2015 Well, it definitely shouldn't be doing that on the second login attempt. We do have a relatively strict lock policy because of brute force attempts on people's accounts in the past. However, it's probably a smidge too aggressive at the moment, so I'll loosen it up a bit. I believe that the forum tracks lockouts on a per-IP basis, though, which would explain why it wouldn't let you in during the lockout period either. Oh, yeah, I can certainly understand the need to have the lock out in that case. I guess it never crossed my mind that something that serious could\would happen here, I'm too optimisitic. Turns out he did log on once prior but successfully after one attempt. Thank you very much for being so responsive and looking into this! (Edited because texting on my phone without typos is like trying to shoot ants) Link to comment
Recommended Posts
Please sign in to comment
You will be able to leave a comment after signing in
Sign In Now