BTM
  1. BTM
  2. BTM-64

If an exception in BTM.begin() occurs, the transaction context is incorrect

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.3
    • Fix Version/s: 2.0.0
    • Labels:
      None
    • Environment:
      BTM SVN Snapshot from 2009-03-22
    • Number of attachments :
      1

      Description

      (We are still running the BTM snapshot version from 2009-03-22, but judging from the code the bug is still present in the newest version)

      I am analyzing a problem that occured on a few of our servers, where we got many "transaction timed out" exceptions until the server was restarted.

      It looks like all of these exceptions occur in the same thread, after an OOM occured:

      Caused by: java.lang.OutOfMemoryError: PermGen space
      at bitronix.tm.utils.ManagementRegistrar.buildObjectName(ManagementRegistrar.java:87)
      at bitronix.tm.utils.ManagementRegistrar.register(ManagementRegistrar.java:57)
      at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:278)
      at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:261)
      at bitronix.tm.BitronixTransaction.setActive(BitronixTransaction.java:256)
      at bitronix.tm.BitronixTransactionManager.begin(BitronixTransactionManager.java:87)
      at XXX.DelegatingTransactionManagerInterceptor.begin(DelegatingTransactionManagerInterceptor.java:131)
      at org.springframework.transaction.jta.JtaTransactionManager.doJtaBegin(JtaTransactionManager.java:886)
      at org.springframework.transaction.jta.JtaTransactionManager.doBegin(JtaTransactionManager.java:843)

      If my analysis is correct, an exception in BTM.begin() after the call to createTransaction() will leave the failed transaction associated to the ThreadContext. This means that after a while, every call to TransactionContextHelper.enlistInCurrentTransaction from this thread will fail, because the old transaction (that never got used due to the exception in begin(), but seems to still be marked by the TransactionTimeoutTask) will be used. Since it's marked as timed out, the call will fail.

      The stacktrace then looks like this:
      Caused by: java.sql.SQLException: error enlisting a JdbcConnectionHandle of a JdbcPooledConnection from datasource user2 in state ACCESSIBLE wrapping org.postgresql.xa.PGXAConnection@74187f3e on Pooled connection wrapping physical connection org.postgresql.jdbc4.Jdbc4Connection@7449fafd
      at bitronix.tm.resource.jdbc.JdbcConnectionHandle.enlistResource(JdbcConnectionHandle.java:61)
      at bitronix.tm.resource.jdbc.JdbcConnectionHandle.prepareStatement(JdbcConnectionHandle.java:200)
      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.springframework.jdbc.datasource.TransactionAwareDataSourceProxy$TransactionAwareInvocationHandler.invoke(TransactionAwareD
      ataSourceProxy.java:225)
      at $Proxy10.prepareStatement(Unknown Source)
      at com.ibatis.sqlmap.engine.execution.SqlExecutor.prepareStatement(SqlExecutor.java:494)
      at com.ibatis.sqlmap.engine.execution.SqlExecutor.executeQuery(SqlExecutor.java:176)
      at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.sqlExecuteQuery(GeneralStatement.java:205)
      at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.executeQueryWithCallback(GeneralStatement.java:173)
      ... 45 more
      Caused by: bitronix.tm.internal.BitronixSystemException: transaction timed out
      at bitronix.tm.resource.common.TransactionContextHelper.enlistInCurrentTransaction(TransactionContextHelper.java:44)
      at bitronix.tm.resource.jdbc.JdbcConnectionHandle.enlistResource(JdbcConnectionHandle.java:59)
      ... 55 more

        Activity

        Hide
        Brett Wooldridge added a comment -

        I'll weigh in on this issue. I do not consider this a bug in BTM. OutOfMemory is a non-recoverable Error.

        As explained above, OOME is not recoverable at runtime. Therefore, the right strategy is to avoid its recurring once it happens, by means of configuration, design, or hardware modifications.

        http://eyalsch.wordpress.com/2009/06/17/oome/

        OOM is an Error not an Exception, and again is not considered recoverable. It is a sign of a misconfigured, or improperly designed application. A properly configured (and designed) server, with no leaks, can run until the Sun (or processor) burns out. An OOM error can occur anywhere at anytime under low memory conditions. An attempt to catch or handle OOM can itself result in an OOM error. Even something as simple as trying to log the exception can itself throw another OOM Error.

        An internet search for OutOfMemory and recoverable (or non-recoverable) will backup what I'm saying. Sun's official stance is that OOM is non-recoverable, that's why from Day 1 it has been a sub-class of VirtualMachineError – not Exception.

        Sun has added an option to let you take some action, like restart your server, when an OOM occurs:

        -XX:OnOutOfMemoryError="<cmd args>;<cmd args>"

        See http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp

        Suggested close.

        Show
        Brett Wooldridge added a comment - I'll weigh in on this issue. I do not consider this a bug in BTM. OutOfMemory is a non-recoverable Error. As explained above, OOME is not recoverable at runtime. Therefore, the right strategy is to avoid its recurring once it happens, by means of configuration, design, or hardware modifications. http://eyalsch.wordpress.com/2009/06/17/oome/ OOM is an Error not an Exception, and again is not considered recoverable. It is a sign of a misconfigured, or improperly designed application. A properly configured (and designed) server, with no leaks, can run until the Sun (or processor) burns out. An OOM error can occur anywhere at anytime under low memory conditions. An attempt to catch or handle OOM can itself result in an OOM error. Even something as simple as trying to log the exception can itself throw another OOM Error. An internet search for OutOfMemory and recoverable (or non-recoverable) will backup what I'm saying. Sun's official stance is that OOM is non-recoverable, that's why from Day 1 it has been a sub-class of VirtualMachineError – not Exception. Sun has added an option to let you take some action, like restart your server, when an OOM occurs: -XX:OnOutOfMemoryError="<cmd args>;<cmd args>" See http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp Suggested close.
        Hide
        Ludovic Orban added a comment -

        I'm not sure this problem can be considered a bug in BTM and even in the case it can what it could do to prevent the problem.

        This problem can only happen in extreme cases, like for instance when a OOME is thrown in the middle of the begin() call. What should BTM do? Catch Throwable, try to clean up then re-throw the throwable as a SystemException?

        OTOH, one could argue and blame this on your code: you provoked the OOME so you should catch it and try to clean up the transactional context.

        I'm no trying to tell you this problem is only yours and I won't do anything about it but I'm rather challenging you to explain me what a middleware is supposed to do and can do in such situation.

        Show
        Ludovic Orban added a comment - I'm not sure this problem can be considered a bug in BTM and even in the case it can what it could do to prevent the problem. This problem can only happen in extreme cases, like for instance when a OOME is thrown in the middle of the begin() call. What should BTM do? Catch Throwable, try to clean up then re-throw the throwable as a SystemException? OTOH, one could argue and blame this on your code: you provoked the OOME so you should catch it and try to clean up the transactional context. I'm no trying to tell you this problem is only yours and I won't do anything about it but I'm rather challenging you to explain me what a middleware is supposed to do and can do in such situation.
        Hide
        Dennis Brakhane added a comment -

        I probably should have made myself more clear.

        I completely agree that the real issue is the OOM, and if BTM wouldn't have get confused, probably some other part of the application would have (although in that particular case, the JVM seems to have recovered)

        I just wanted to suggest that BTM could disassociate the recently opened TX when something unexpected occured during initialization.

        It's certainly a minor issue or even a wishlist item.

        Show
        Dennis Brakhane added a comment - I probably should have made myself more clear. I completely agree that the real issue is the OOM, and if BTM wouldn't have get confused, probably some other part of the application would have (although in that particular case, the JVM seems to have recovered) I just wanted to suggest that BTM could disassociate the recently opened TX when something unexpected occured during initialization. It's certainly a minor issue or even a wishlist item.
        Hide
        Ludovic Orban added a comment -

        Sure, but how?

        Show
        Ludovic Orban added a comment - Sure, but how?
        Hide
        Brett Wooldridge added a comment -

        There is no way to adequately address this issue. The OOM could have occured any number of placed in BTM and caused similar havoc. If we silently disassociated the transaction from the thread, the app would fail. If we tried to disassociate the transactiion we could encounter another OOM. There are several thousand of lines of code in BTM, and any given line, creating a string, comparing two objects, calling a method, could be hit with an OOM error and the effects on BTM are neither preventable nor recoverable.

        BTM was the unlucky victim in the case, but if it were the JDBC driver, log4j, or any other component, somebody else would be looking at a bug, not BTM. I don't mean to be rude or short, but it's like shooting the messenger. BTM was the unfortunate victim of a VM configured with too few resources. Even if we put a try catch in this piece of code, 1) it still could fail with another OOM, and 2) the odds of an OOM hitting that small section of code again instead of any of the thousands of others is almost zero.

        Here's an experiment. Configure your VM with 64M of memory, and see how well your application, or any other component in the system deals with the OOM that will result.

        It is not recoverable, or preventable, and trying to "fail gracefully" is just passing the OOM problem down the line.

        Show
        Brett Wooldridge added a comment - There is no way to adequately address this issue. The OOM could have occured any number of placed in BTM and caused similar havoc. If we silently disassociated the transaction from the thread, the app would fail. If we tried to disassociate the transactiion we could encounter another OOM. There are several thousand of lines of code in BTM, and any given line, creating a string, comparing two objects, calling a method, could be hit with an OOM error and the effects on BTM are neither preventable nor recoverable. BTM was the unlucky victim in the case, but if it were the JDBC driver, log4j, or any other component, somebody else would be looking at a bug, not BTM. I don't mean to be rude or short, but it's like shooting the messenger. BTM was the unfortunate victim of a VM configured with too few resources. Even if we put a try catch in this piece of code, 1) it still could fail with another OOM, and 2) the odds of an OOM hitting that small section of code again instead of any of the thousands of others is almost zero. Here's an experiment. Configure your VM with 64M of memory, and see how well your application, or any other component in the system deals with the OOM that will result. It is not recoverable, or preventable, and trying to "fail gracefully" is just passing the OOM problem down the line.
        Hide
        Dennis Brakhane added a comment - - edited

        Once again, forget about the OOM. I do not expect any library to fail gracefully in this case, it just happened that I stumbled upon the BTM.begin() section of the code while analysing what happened.

        I just want to point out that an exception in that code block(for example, if setActive throws an exception, because the journal got an IOException)) might result in all further transaction calls from that unlucky thread failing with a confusing "transaction timed out"

        I'm thinking about something along the lines of

        currentTx = createTransaction();

        try

        { ... }

        catch (Exception e)

        { //notice that I do not catch Throwable removeTransactionFromContext(); throw e; }
        Show
        Dennis Brakhane added a comment - - edited Once again, forget about the OOM. I do not expect any library to fail gracefully in this case, it just happened that I stumbled upon the BTM.begin() section of the code while analysing what happened. I just want to point out that an exception in that code block(for example, if setActive throws an exception, because the journal got an IOException)) might result in all further transaction calls from that unlucky thread failing with a confusing "transaction timed out" I'm thinking about something along the lines of currentTx = createTransaction(); try { ... } catch (Exception e) { //notice that I do not catch Throwable removeTransactionFromContext(); throw e; }
        Hide
        Brett Wooldridge added a comment -

        Ok, if we're not talking about handling the OOM case but other exceptions, I see your point. The transaction timeout exception is confusing. Essentially, the begin() call needs to be completely atomic. Either it succeeds or fails cleanly. As it is now, if it fails it leaves a side-effect – that of having a (bogus) transaction context on the thread.

        What do you think Ludovic? I've attached a patch, which does this in begin():

        BitronixTransactionManager.java
        try {
            currentTx = createTransaction();
        
            currentTx.getSynchronizationScheduler().add(new ClearContextSynchronization(currentTx), Scheduler.ALWAYS_LAST_POSITION -1);
            currentTx.setActive();
            if (log.isDebugEnabled()) log.debug("begun new transaction at " + currentTx.getResourceManager().getGtrid().extractTimestamp());
        }
        catch (SystemException se) {
            clearCurrentContext();
            throw se;
        }
        catch (Exception e) {
            clearCurrentContext();
            throw new BitronixSystemException("exception during transaction begin", e);
        }
        
        Show
        Brett Wooldridge added a comment - Ok, if we're not talking about handling the OOM case but other exceptions, I see your point. The transaction timeout exception is confusing. Essentially, the begin() call needs to be completely atomic. Either it succeeds or fails cleanly. As it is now, if it fails it leaves a side-effect – that of having a (bogus) transaction context on the thread. What do you think Ludovic? I've attached a patch, which does this in begin(): BitronixTransactionManager.java try { currentTx = createTransaction(); currentTx.getSynchronizationScheduler().add( new ClearContextSynchronization(currentTx), Scheduler.ALWAYS_LAST_POSITION -1); currentTx.setActive(); if (log.isDebugEnabled()) log.debug( "begun new transaction at " + currentTx.getResourceManager().getGtrid().extractTimestamp()); } catch (SystemException se) { clearCurrentContext(); throw se; } catch (Exception e) { clearCurrentContext(); throw new BitronixSystemException( "exception during transaction begin" , e); }
        Hide
        Ludovic Orban added a comment -

        We have a common agreement, there is nothing we can do about OOME but any other manageable exception can and should be catched.

        @Brett: you got the idea right but there are some subtleties you missed, like canceling the TX timeout and the fact that clearCurrentContext leaves some information dangling as it is meant for the suspend case only.

        I've basically taken your patch, polished it and committed all that on the trunk. Have a look at revision #360 if you're interested in the gory details.

        Show
        Ludovic Orban added a comment - We have a common agreement, there is nothing we can do about OOME but any other manageable exception can and should be catched. @Brett: you got the idea right but there are some subtleties you missed, like canceling the TX timeout and the fact that clearCurrentContext leaves some information dangling as it is meant for the suspend case only. I've basically taken your patch, polished it and committed all that on the trunk. Have a look at revision #360 if you're interested in the gory details.

          People

          • Assignee:
            Ludovic Orban
            Reporter:
            Dennis Brakhane
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: