Continuum
  1. Continuum
  2. CONTINUUM-2371

Unable to view queues after loss of build agent

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.4 (Beta)
    • Fix Version/s: 1.4.0 (Beta)
    • Component/s: Distributed Builds
    • Labels:
      None
    • Complexity:
      Intermediate
    • Number of attachments :
      0

      Description

      When a build agent goes down unexpectedly, the system does not handle it gracefully.

      The Queues page (/displayQueues!display.action) shows Error Occurred org.apache.maven.continuum.ContinuumException: Error retrieving projects currently preparing build in http://localhost:8181/continuum-buildagent/xmlrpc

      To reproduce, add 2 or more agents, force a bunch of builds, then stop one of the agents without disabling it from the master.

      I notice that the Build Agents page (/buildAgentList.action) still shows the agent as Enabled=true, even though it is not responding.

      There should probably be some concept of the system automatically disabling an agent that is not behaving correctly. For now I'd be happy with the admin having to manually re-enable it, but an auto-recovery feature might be a nice enhancement for later.

        Issue Links

          Activity

          Hide
          Wendy Smoak added a comment -

          The second time I tried this, I was able to edit and disable the build agent which had been lost, however the Queues page still showed the error.

          Removing the bad agent from the build agent group also did not help.

          If an agent is disabled, I don't think it should be contacted to see what is in its queue.

          If it is going to be contacted, then the system needs to handle the case where it is not reachable.

          Show
          Wendy Smoak added a comment - The second time I tried this, I was able to edit and disable the build agent which had been lost, however the Queues page still showed the error. Removing the bad agent from the build agent group also did not help. If an agent is disabled, I don't think it should be contacted to see what is in its queue. If it is going to be contacted, then the system needs to handle the case where it is not reachable.
          Hide
          Wendy Smoak added a comment -

          Even after deleting the dead build agent, the Queues page still shows an error.

          Apparently the system is still trying to get information from the agent, even after it has been deleted.

          Possibly related, I notice that the continuum.xml config file still contains this deleted agent.

          A re-start syncs up the continuum.xml file with reality and makes the Queues page work again.

          It should not be necessary to re-start in this situation.

          Show
          Wendy Smoak added a comment - Even after deleting the dead build agent, the Queues page still shows an error. Apparently the system is still trying to get information from the agent, even after it has been deleted. Possibly related, I notice that the continuum.xml config file still contains this deleted agent. A re-start syncs up the continuum.xml file with reality and makes the Queues page work again. It should not be necessary to re-start in this situation.
          Show
          Wendy Smoak added a comment - Related thread: http://www.nabble.com/What-should-happen-when-a-distributed-agent-dies--td25649131.html
          Hide
          Wendy Smoak added a comment -

          The problem with editing/deleting an agent not seeming to have any effect should have been fixed by CONTNUUM-2377. The continuum.xml file should now stay in sync without a restart.

          We still need to deal with the unexpected loss of an agent that is still configured as 'enabled'.

          Show
          Wendy Smoak added a comment - The problem with editing/deleting an agent not seeming to have any effect should have been fixed by CONTNUUM-2377. The continuum.xml file should now stay in sync without a restart. We still need to deal with the unexpected loss of an agent that is still configured as 'enabled'.
          Hide
          Maria Catherine Tan added a comment -

          fixed in r926483

          1. check if the agent is available by calling the ping method before proceeding with the call to other methods
          2. if the agent is not available, disable it.

          Show
          Maria Catherine Tan added a comment - fixed in r926483 1. check if the agent is available by calling the ping method before proceeding with the call to other methods 2. if the agent is not available, disable it.
          Hide
          Maria Catherine Tan added a comment -

          Added unit test in r926826
          Update documentation in r926839

          Show
          Maria Catherine Tan added a comment - Added unit test in r926826 Update documentation in r926839

            People

            • Assignee:
              Maria Catherine Tan
              Reporter:
              Wendy Smoak
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: