Last week we moved some of the significant user profile related tables to separate database (blogpost coming soon), and when we deployed some code preparing for the task we noticed we lost a few milliseconds to the ~5ms median response time.
After dropping some tears, we rolled back the code to win back the lost performance and started to check it to see if we can find the cause of response time degradation. Reverting the last pull request hasn't resolved the issue, so we thought that maybe there were some other pull requests in the master repo which were deployed to other apps but not the app we deployed now.
We still have the multiple apps in one git repository as we're still separating the monolith: it made it easier to do the separation without blocking other developments, like new features or bugfixes. This way it's possible that someone merges a pull request into the repository and only deploys the related app (which is an sbt subproject), so the same code got deployed to other apps later, when a new pull request merged related to those apps. So reverting the last pull request and deploying the code without it could still contain some previously undeployed code: eg. last deployed pull request was #198, the current one is #200, reverting and redeploying brings #199 into production (or even more if last deployed for the app we are looking at was an earlies one, like #190).
We dropped another tears of sorrow and opened up New Relic. New Relic is a pretty awesome tool for monitoring and performance debugging your application (yeah, 20% discount on this month's fee will be fine for the advertising, thanks).
Normally you can see some basic "Breakdown table" for transactions (which are basically controller actions), but it's not that informative:
Okay, move along, nothing to see here. You can mark a transaction as "Key transaction", and you'll get some more insight. New Relic will save a transaction trace every few minutes, so you can look at concrete SQL queries with execution times, and the same breakdown table but instead of averages you see the actual times for the current transaction. After made sure that there's no slow or unnecessary queries, we reached for the holy grail: X-Ray.
New Relic's X-Ray Sessions feature allows you to gain deeper insights into a key transaction's performance by showing transaction traces alongside long-running profiler results. After you start an x-ray session, New Relic will collect up to 100 transaction traces and a thread profile for your transaction. Collection automatically stops at 100 traces or 24 hours, whichever comes first.
You can enable it from a browser on the fly, it doesn't need any code changes, and in a few minutes you'll have really deep insights what is going on.
As you can see at the bottom, at some point
ConfigException is raised, which took 6.7% of the request CPU time. Digging into it we found that we used bodyparsers without passing the
maxLength parameter, so play tried to retrieve it from the configuration and when it was unable to find it, simply used 100k for that value.
The problem is that play.api.Configuration.getBytes calls
play.api.Configuration.readValue, and there's a try-catch block inside, so there's an exception raised with all the stack traces just to to give you a None at the end of the day.
parsers.text.maxLength to the application config gave us the exactly same application behaviour but saved us some performance with eliminating a raised exception for every request.
Another config value we found this way is trustxforwarded, you should set it to false if you don't need it or you just don't know if you need it.
So at the end of the day we got the actual action call take 4.2% of the cpu time instead of the previous 28% and we did it without degrading the performance of the other parts of the picture. :)
If you have any ideas what is going on in the
org.jboss.netty.channel.socket.nio.AbstractNioSelector.select branch (the controller got called in the .process branch) and if there's anything can be improved there feel free to tell us!