RVM
  1. RVM
  2. RVM-687

Cattrack performance: regression report unusable

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Critical Critical
    • Resolution: Not A Bug
    • Affects Version/s: None
    • Fix Version/s: hg tip
    • Labels:
      None
    • Number of attachments :
      1

      Description

      The last several weeks, cattrack pretty much always timesout/fails when I try to get it to show me a regression report.

      The server as a whole isn't down (one can follow a failure link and navigate around program output, etc), but asking for a regression or performance report results in the request "hanging" until it times out.

      I suspect some performance pathology in the querries used to construct the regression report (looking for last 10 versions of the run??) but don't know enough about how the system works to figure out what the problem is.

        Issue Links

          Activity

          Hide
          David Grove added a comment -

          Also, once you ask for a regression report, the ability to navigate to "simpler" pages via direct links goes away. So it seems like asking for the regression report is sufficient to hang the server.

          Show
          David Grove added a comment - Also, once you ask for a regression report, the ability to navigate to "simpler" pages via direct links goes away. So it seems like asking for the regression report is sufficient to hang the server.
          Hide
          Ian Rogers added a comment -

          I'm not familiar with the source to cattrack. From digging around I came across:

          http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/cattrack/trunk/app/models/report/performance_report_statistic_renderer.rb?revision=13645&view=markup

          Line 91 defines how the results are gained, we are attempting to draw a pixel for every result we've ever had, even if that's more pixels than are visible in the image.

          I saw performance logs on cattrack and the percentage of execution time was something like 90% for the query and then 10% to render the image. Because of the performance I considered having a rather simpler cattrack light written in google's app engine and using their chart APIs.

          Show
          Ian Rogers added a comment - I'm not familiar with the source to cattrack. From digging around I came across: http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/cattrack/trunk/app/models/report/performance_report_statistic_renderer.rb?revision=13645&view=markup Line 91 defines how the results are gained, we are attempting to draw a pixel for every result we've ever had, even if that's more pixels than are visible in the image. I saw performance logs on cattrack and the percentage of execution time was something like 90% for the query and then 10% to render the image. Because of the performance I considered having a rather simpler cattrack light written in google's app engine and using their chart APIs.
          Hide
          Peter Donald added a comment -

          There was several performance problems that I discovered last time I had a look at this. For one we do a massive join over large amounts of data everytime we draw a graph and some of the joins required a table scan which just takes longer and longer as more data gets added. Also the number of results returned just keeps growing over time. One simple solution is to limit the number of results to a max time period (I would suggest 3 months). (So just need to add another criteria in where clause). Another solution that stops one runaway process locking the whole system is to serve the app over a cluster of mongrel instances (See RVM-307) as currently the app is single threaded.

          Show
          Peter Donald added a comment - There was several performance problems that I discovered last time I had a look at this. For one we do a massive join over large amounts of data everytime we draw a graph and some of the joins required a table scan which just takes longer and longer as more data gets added. Also the number of results returned just keeps growing over time. One simple solution is to limit the number of results to a max time period (I would suggest 3 months). (So just need to add another criteria in where clause). Another solution that stops one runaway process locking the whole system is to serve the app over a cluster of mongrel instances (See RVM-307 ) as currently the app is single threaded.
          Hide
          Daniel Frampton added a comment -

          The number of rows actually returned is only in the order of 1500 at this stage. Running the results query directly against the database results in somewhat slow but much more reasonable performance than cattrack is showing (a second or so for each result). The logs show something like 30-70% for the DB portion of execution but I think there might be some overhead in how ruby is accessing the database that is contributing to this as it seems higher than I would expect.

          The loops that iterate over the data and draw the image are also completely unoptimized, and I imagine add a significant amount of overhead.

          I prototyped adding conditions to limit changes to 3 months and performance was still very poor, as well as going against part of the idea of having these graphs – which is to give us a more complete historical view of performance.

          We really need to perform some level of aggregation for the older values in a format ready for graphing, but I don't have the time to work on this at the moment.

          Peter has identified one of the fundamental problems with the infrastructure at the moment: we have a single-threaded app (which is also very slow – interpreter only), so if anything goes out to lunch we can't even serve basic pages... The regression report requires dozens of requests (usually several at a time).

          Once I finish writing (quite a while) I will look at what we can do – including the possibility of changing the implementation to something that more active members of the project (> 0) have some experience with.

          Show
          Daniel Frampton added a comment - The number of rows actually returned is only in the order of 1500 at this stage. Running the results query directly against the database results in somewhat slow but much more reasonable performance than cattrack is showing (a second or so for each result). The logs show something like 30-70% for the DB portion of execution but I think there might be some overhead in how ruby is accessing the database that is contributing to this as it seems higher than I would expect. The loops that iterate over the data and draw the image are also completely unoptimized, and I imagine add a significant amount of overhead. I prototyped adding conditions to limit changes to 3 months and performance was still very poor, as well as going against part of the idea of having these graphs – which is to give us a more complete historical view of performance. We really need to perform some level of aggregation for the older values in a format ready for graphing, but I don't have the time to work on this at the moment. Peter has identified one of the fundamental problems with the infrastructure at the moment: we have a single-threaded app (which is also very slow – interpreter only), so if anything goes out to lunch we can't even serve basic pages... The regression report requires dozens of requests (usually several at a time). Once I finish writing (quite a while) I will look at what we can do – including the possibility of changing the implementation to something that more active members of the project (> 0) have some experience with.
          Hide
          David Grove added a comment -

          I'm starting to learn Ruby & Rails, so this may be a good practical learning exercise for me.

          Assuming that we can make it work, I'd like to leave it as Ruby app since that seems to be the "right" modern technology for building these kinds of applications.

          Show
          David Grove added a comment - I'm starting to learn Ruby & Rails, so this may be a good practical learning exercise for me. Assuming that we can make it work, I'd like to leave it as Ruby app since that seems to be the "right" modern technology for building these kinds of applications.
          Hide
          David Grove added a comment -

          Some random thoughts.

          (1) Queries with multiple hosts will rarely (never?) make sense. Would we gain anything by breaking things up so that we had each host being served by its own cattrack instance?

          (2) The purge that Daniel did to chop the first 5000 test runs has gotten performance back to quite manageable levels for me. This seems to imply that in the queries being used in regression_report.rb (which is the primary one that was unusable before the purge) were doing something that was O(# sanity runs). At least conceptually, it seems like this could be avoided because we only should need to look at the last 10 of the test runs for that host, variant combination to get all the data we need. Looking at the logs, it appears that pretty much all the time is going into database access:

          Processing TestRunController#regression_report (for 129.34.20.19 at 2009-01-15 07:48:22) [GET]
          Parameters:

          {"host_name"=>"jalapeno.anu.edu.au", "action"=>"regression_report", "controller"=>"results/test_run", "test_run_variant"=>"sanity", "test_run_id"=>"7257"}

          Rendering within layouts/application
          Rendering results/test_run/regression_report
          Completed in 4.82860 (0 reqs/sec) | Rendering: 0.15054 (3%) | DB: 4.66213 (96%) | 200 OK http://jikesrvm.anu.edu.au/results/jalapeno.anu.edu.au/sanity/7257/regression_report

          (3) Should we think about attempting an upgrade to Rails 2.2, latest mongrel, postgres binding, etc. etc?

          Show
          David Grove added a comment - Some random thoughts. (1) Queries with multiple hosts will rarely (never?) make sense. Would we gain anything by breaking things up so that we had each host being served by its own cattrack instance? (2) The purge that Daniel did to chop the first 5000 test runs has gotten performance back to quite manageable levels for me. This seems to imply that in the queries being used in regression_report.rb (which is the primary one that was unusable before the purge) were doing something that was O(# sanity runs). At least conceptually, it seems like this could be avoided because we only should need to look at the last 10 of the test runs for that host, variant combination to get all the data we need. Looking at the logs, it appears that pretty much all the time is going into database access: Processing TestRunController#regression_report (for 129.34.20.19 at 2009-01-15 07:48:22) [GET] Parameters: {"host_name"=>"jalapeno.anu.edu.au", "action"=>"regression_report", "controller"=>"results/test_run", "test_run_variant"=>"sanity", "test_run_id"=>"7257"} Rendering within layouts/application Rendering results/test_run/regression_report Completed in 4.82860 (0 reqs/sec) | Rendering: 0.15054 (3%) | DB: 4.66213 (96%) | 200 OK http://jikesrvm.anu.edu.au/results/jalapeno.anu.edu.au/sanity/7257/regression_report (3) Should we think about attempting an upgrade to Rails 2.2, latest mongrel, postgres binding, etc. etc?
          Hide
          Peter Donald added a comment -

          Re: (1) I suspect this will make no significant performance impact and given the effort required to achieve this then it I wouldn't bother. If there is a significant performance impact then it is likely because of missing indexes.

          Re: (2) Try putting "config.log_level = :debug" into environment.rb and restarting the server. Then hit the regression report again. This should give you timings for each individual query which should help to isolate the problematic query. Then it should be easy to either index the columns causing the issues or rework the query to be less of a monster. (Then comment the "config.log_level ..." statement and restart server again as it produces heaps of debug output and you don't want to fill up the drive with that.)

          Re: (3) - I wouldn't bother unless you plan to do significant new development. There is no huge performance gain to be had there and the amount of little things that have changed would require a bit of effort to get going again.

          Show
          Peter Donald added a comment - Re: (1) I suspect this will make no significant performance impact and given the effort required to achieve this then it I wouldn't bother. If there is a significant performance impact then it is likely because of missing indexes. Re: (2) Try putting "config.log_level = :debug" into environment.rb and restarting the server. Then hit the regression report again. This should give you timings for each individual query which should help to isolate the problematic query. Then it should be easy to either index the columns causing the issues or rework the query to be less of a monster. (Then comment the "config.log_level ..." statement and restart server again as it produces heaps of debug output and you don't want to fill up the drive with that.) Re: (3) - I wouldn't bother unless you plan to do significant new development. There is no huge performance gain to be had there and the amount of little things that have changed would require a bit of effort to get going again.
          Hide
          David Grove added a comment -

          Thanks for the quick response Peter.

          I attached a log from a regression_report. I could be reading it wrong, but it looks like the problem queries are the ones might suspect....the computation of new/intermittent/consistent failures by doing complex joins across the last 10 runs on all the test cases. Is that right? If so, is there anything to be done? They are taking about .6 seconds each and the entire page view takes about 5 seconds.

          Show
          David Grove added a comment - Thanks for the quick response Peter. I attached a log from a regression_report. I could be reading it wrong, but it looks like the problem queries are the ones might suspect....the computation of new/intermittent/consistent failures by doing complex joins across the last 10 runs on all the test cases. Is that right? If so, is there anything to be done? They are taking about .6 seconds each and the entire page view takes about 5 seconds.
          Hide
          Daniel Frampton added a comment -

          There are a bunch of different things we will ultimately need to do

          1) We can try and improve individual query results by changing the queries and database indexes.
          2) We can pre-calculate and cache results (and/or images) for the slower queries into the tables when adding result runs.
          3) We can make the system gracefully handle a slow query (such that this doesn't kill the entire system).

          I think issue (3) is the real priority at the moment. Then I would look at (1) and (2). While there may be some low hanging fruit in (1) as Peter suggests, I think the fundamental problem we have at the moment is how the system deals with slow queries and peak demand: a slow page should time out, and not bring down the entire system. Also, it appears that our current system is single threaded; a slow result prevents even a fast cached page from being returned.

          Show
          Daniel Frampton added a comment - There are a bunch of different things we will ultimately need to do 1) We can try and improve individual query results by changing the queries and database indexes. 2) We can pre-calculate and cache results (and/or images) for the slower queries into the tables when adding result runs. 3) We can make the system gracefully handle a slow query (such that this doesn't kill the entire system). I think issue (3) is the real priority at the moment. Then I would look at (1) and (2). While there may be some low hanging fruit in (1) as Peter suggests, I think the fundamental problem we have at the moment is how the system deals with slow queries and peak demand: a slow page should time out, and not bring down the entire system. Also, it appears that our current system is single threaded; a slow result prevents even a fast cached page from being returned.
          Hide
          Peter Donald added a comment -

          I just had a look and can't see any easy way to tackle (1). It looks like most of the slow queries now generate a huge temp table (i.e. 22 thousand largish rows) which exceeds the shared memory cache. It may be possible to tune postgres but I don't know how off the top of my head.

          (3) can be tackled by RVM-307 ... should be easy to do but I am reluctant to tackle as I can't monitor the situation.

          (2) is probably the way forward. At one stage I had a version that did produce a nice easy queryable olap like table (see tags/pre-olap-deletion) so this could potentially be revived for performing queries against.

          Images/html generated is already cached (see the public/results/* directories) but the page needs to be hit once. It looks like someone (maybe me?) tried to do this via wget after an import but disabled this and had some bad syntax. I fixed the syntax errors and reenabled this so it may be possible that after import the performance report is regenerated and thus is served as a static page when a human requests it. (To disable it in case I stuffed something up change perform_wget=true to perform_wget=false in app/services/test_run_importer.rb)

          Show
          Peter Donald added a comment - I just had a look and can't see any easy way to tackle (1). It looks like most of the slow queries now generate a huge temp table (i.e. 22 thousand largish rows) which exceeds the shared memory cache. It may be possible to tune postgres but I don't know how off the top of my head. (3) can be tackled by RVM-307 ... should be easy to do but I am reluctant to tackle as I can't monitor the situation. (2) is probably the way forward. At one stage I had a version that did produce a nice easy queryable olap like table (see tags/pre-olap-deletion) so this could potentially be revived for performing queries against. Images/html generated is already cached (see the public/results/* directories) but the page needs to be hit once. It looks like someone (maybe me?) tried to do this via wget after an import but disabled this and had some bad syntax. I fixed the syntax errors and reenabled this so it may be possible that after import the performance report is regenerated and thus is served as a static page when a human requests it. (To disable it in case I stuffed something up change perform_wget=true to perform_wget=false in app/services/test_run_importer.rb)
          Hide
          David Grove added a comment -

          This particular problem has been "solved" by reducing the size of the database.

          Will handle misc improvements in other issues.

          Show
          David Grove added a comment - This particular problem has been "solved" by reducing the size of the database. Will handle misc improvements in other issues.

            People

            • Assignee:
              David Grove
              Reporter:
              David Grove
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: