Monday, 23 January 2012
Where are all the pretty pictures?
You may have noticed this morning (or may still be noticing) that there are no vehicle photos on Lemonfree! Luckily this is only temporary; we store our images in Amazon S3; and they were having DNS resolution issues early this morning. Amazon says the issue is resolved, so now it's just a matter of time while the DNS changes propagate throughout the internet. Photos will return soon!
Wednesday, 30 November 2011
Yesterday's outage (and sudden MySQL upgrade)
Early yesterday afternoon we unfortunately had an unplanned outage, roughly 30-45 minutes. The database instance became unresponsive and we were unable to immediately cycle MySQL or completely reboot the instance. The decision was to get a new instance up and running rather than hope that the broken server fixed itself; as we had already been planning a MySQL upgrade we decided to go ahead and implement that immediately on the new instance. The outage was roughly 1/3 troubleshooting and 2/3 restoring data from backups.
Poking around once everything was stable revealed a few corrupted tables on the old instance, which were likely responsible for MySQL not restarting. We're still not sure what caused that corruption.
Lemonfree.com now uses MySQL 5.1.52, which we had decided to upgrade to for additional useful features, and because 5.0.x is now an EOL version.
Poking around once everything was stable revealed a few corrupted tables on the old instance, which were likely responsible for MySQL not restarting. We're still not sure what caused that corruption.
Lemonfree.com now uses MySQL 5.1.52, which we had decided to upgrade to for additional useful features, and because 5.0.x is now an EOL version.
Thursday, 18 August 2011
Keeping track of detailed performance metrics
As a big proponent of the "track everything possible" philosophy, I like having numbers and pretty graphs for anything I can find. When it comes to websites (or any application) there are many different types of metrics you can track; page speed, database response times, lead generation, etc. In order to pin down issues we also need to log events, such as code pushes or our extensive partner data imports. It probably goes without saying that any solution we use for tracking has to be as simple as possible to implement, and have minimal to no effect on the website's performance. That sounds like a lot of requirements, so what to use?
Originally I had tried out graphite, since rrdtool was a little more complicated to implement and we didn't require most of the features it provided. This worked great at first; a single line to track something and graphs available pretty much instantly. We soon realised that while this gave us an excellent overview of what was going on, we actually needed a few extra features that neither solution provided:
1. Additional detail for a metric
Having the measurement and time is definitely the most important part, but there are situations where analyzing some metrics would benefit from having additional data attached to them. For example we track execution and response time for most of our database queries. Queries like the one we use for search, however, are constantly varying depending on what parameters are used in the search. So when we look at a graph and see that hey, the search query had a huge spike last night, it would be very useful to see the exact query string(s) involved. Neither graphite nor rrdtool can attach arbitrary text to a metric measurement, since that's not really their purpose. We could probably log the queries and timestamps separately and then try and correlate them afterwards, but that comes with additional overhead and complexity.
2. More graphing features
Since they are both meant for semi-consistent time series data, the graphing features in graphite and rrdtool are necessarily limited to line plots. For some of the things we measure it can be nice in a display sense to have other types of graphs, such as pie (lead success/fail percentages). A few other features like multiple Y axes and data point labels would also be handy. I could add all that into the graphing provided by the tools, but since there are other non-graph features I'd like to have it makes more sense to go custom.
So with all that, it looks like we'll be putting together something from scratch...well not completely from scratch, as we will borrow some concepts from existing tools, but since our requirements are so different from the intended purpose of those tools it seems that it would be easier (especially from a future maintenance POV) and make more sense to write something rather than customize an existing project that wasn't meant to work that way.
Originally I had tried out graphite, since rrdtool was a little more complicated to implement and we didn't require most of the features it provided. This worked great at first; a single line to track something and graphs available pretty much instantly. We soon realised that while this gave us an excellent overview of what was going on, we actually needed a few extra features that neither solution provided:
1. Additional detail for a metric
Having the measurement and time is definitely the most important part, but there are situations where analyzing some metrics would benefit from having additional data attached to them. For example we track execution and response time for most of our database queries. Queries like the one we use for search, however, are constantly varying depending on what parameters are used in the search. So when we look at a graph and see that hey, the search query had a huge spike last night, it would be very useful to see the exact query string(s) involved. Neither graphite nor rrdtool can attach arbitrary text to a metric measurement, since that's not really their purpose. We could probably log the queries and timestamps separately and then try and correlate them afterwards, but that comes with additional overhead and complexity.
2. More graphing features
Since they are both meant for semi-consistent time series data, the graphing features in graphite and rrdtool are necessarily limited to line plots. For some of the things we measure it can be nice in a display sense to have other types of graphs, such as pie (lead success/fail percentages). A few other features like multiple Y axes and data point labels would also be handy. I could add all that into the graphing provided by the tools, but since there are other non-graph features I'd like to have it makes more sense to go custom.
So with all that, it looks like we'll be putting together something from scratch...well not completely from scratch, as we will borrow some concepts from existing tools, but since our requirements are so different from the intended purpose of those tools it seems that it would be easier (especially from a future maintenance POV) and make more sense to write something rather than customize an existing project that wasn't meant to work that way.
Labels:
metrics,
performance
Tuesday, 9 August 2011
Last night's outage
Lemonfree.com was unavailable last night from around 9:30pm – 10:00pm CST. This outage was caused by a network issue affecting the Amazon zone where our servers are located (Lemonfree.com runs on Amazon EC2 instances). Happily Amazon got the jump on it and had everything working normally by 10.
What can we do to keep this particular problem from affecting Lemonfree.com again? A good start would probably be an automatic failover in a different AWS zone! This should handle most Amazon-related issues, as they generally are confined to a single zone at most.
Sorry to anyone who was unable to access Lemonfree.com during this time!
Subscribe to:
Posts (Atom)