BTM
  1. BTM
  2. BTM-43

Fix for BTM-33 leads to another problem

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.2
    • Fix Version/s: 1.3.3
    • Labels:
      None
    • Environment:
      BTM trunk version, PostgreSQL JDBC
    • Testcase included:
      yes
    • Number of attachments :
      3

      Description

      This is a continuation of the bug reported on the mailing list
      http://www.nabble.com/Fix-for-BTM-33-leads-to-another-problem-td22285282.html

      I have attached the file to reproduce the problem and the corresponding LOG-Output.
      It was run against the trunk version (r268)

      I also have the debug logs from the live system, but I doubt that they are of value and therefore have omitted them. If they are really needed, I can provide them, although I would have to anonymize and shrink them quite a bit (they are 900M for 5 minutes worth of logs)

      To reproduce the issue, just run the attached program against a database (I used PostgreSQL, but it probably shouldn't matter) that has a table named "users".

      After a while (I noticed times from 5 seconds to 2 minutes), the program will halt with the following exception:
      java.sql.SQLException: error requeueing a JdbcPooledConnection from datasource aaa in state ACCESSIBLE wrapping org.postgresql.xa.PGXAConnection@f894ce
      at bitronix.tm.resource.jdbc.JdbcPooledConnection.release(JdbcPooledConnection.java:148)
      at bitronix.tm.resource.jdbc.JdbcConnectionHandle.close(JdbcConnectionHandle.java:78)
      at Btmtest.test(Btmtest.java:83)
      at Btmtest.main(Btmtest.java:43)
      Caused by: bitronix.tm.internal.BitronixSystemException: cannot close a resource when its XAResource is taking part in an unfinished global transaction
      at bitronix.tm.resource.common.TransactionContextHelper.requeue(TransactionContextHelper.java:117)
      at bitronix.tm.resource.jdbc.JdbcPooledConnection.release(JdbcPooledConnection.java:146)
      ... 3 more

      The following information was already mentioned in the mails, but I add them here for completeness:

      The bug DOES occur:

      • With r242
      • With r268
        It DOES NOT occur (to be more precise, the program was run for 5 minutes without problems):
      • With r241
      • When the main thread
      1. Btmtest.java
        3 kB
        Dennis Brakhane
      2. error-after-alleged-fix.tar.gz
        14 kB
        Dennis Brakhane
      3. use-this-log-file.log.gz
        12 kB
        Dennis Brakhane

        Issue Links

          Activity

          Hide
          Dennis Brakhane added a comment -

          Bad news.

          We retried today on 2 different machines, and we got exactly the same problems.

          It started with a corrupted log exception, intestingly again at position 21.
          We then shut the server down, moved the logs and started it up again.
          After a while, the log was again corrupted, at position 21.

          Some more facts:

          In both tests (yesterday and today), the corruption only occured on a 64 bit server, the 32 bit didn't show it. However, the 32-bit servers both have less memory available and therefore spend more time GCing as well as not having much memory free for file system caches etc, and secondly, they don't serve as many domains as the 64 bit ones do, so they don't have as much traffic and database connections.

          Some more random facts:

          • during the tests, the average number of postgresql locks increased, but this could also be a result of the bug and the accompianing transaction leak
          • BTM didn't log any exceptions and seems to work until the corruption; the other servers running 1.3.2 regulary log exceptions like "cannot switch state from IN_POOL to IN_POOL" and the exception this bug report was originally about

          Any thoughts about this? If not, I will try to rewrite the application to always use global transactions in the hope it will work well with 1.3.2

          I could provide the corrupted btm.tlog files if they would help (and if they don't store any sensitive information).

          Show
          Dennis Brakhane added a comment - Bad news. We retried today on 2 different machines, and we got exactly the same problems. It started with a corrupted log exception, intestingly again at position 21. We then shut the server down, moved the logs and started it up again. After a while, the log was again corrupted, at position 21. Some more facts: In both tests (yesterday and today), the corruption only occured on a 64 bit server, the 32 bit didn't show it. However, the 32-bit servers both have less memory available and therefore spend more time GCing as well as not having much memory free for file system caches etc, and secondly, they don't serve as many domains as the 64 bit ones do, so they don't have as much traffic and database connections. Some more random facts: during the tests, the average number of postgresql locks increased, but this could also be a result of the bug and the accompianing transaction leak BTM didn't log any exceptions and seems to work until the corruption; the other servers running 1.3.2 regulary log exceptions like "cannot switch state from IN_POOL to IN_POOL" and the exception this bug report was originally about Any thoughts about this? If not, I will try to rewrite the application to always use global transactions in the hope it will work well with 1.3.2 I could provide the corrupted btm.tlog files if they would help (and if they don't store any sensitive information).
          Hide
          Dennis Brakhane added a comment -

          I looked at the transaction log files myself, and they are indeed corrupted. What I have gathered so far is this:
          The corrupt entry is the first entry in btm1.log, the record type is 08h and the record size is given as 4Eh bytes, while in fact it is 7 bytes smaller. Interestingly, the broken entry contains only one "unique name", there are other entries which are of type 08h and 4Eh bytes long, these contain 2 unique names, the second one being 5 bytes long.

          (I have only looked at one of the corrupted logs, but as the other two also seem to be corrupt at pos 21, I assume that they have the same problem)

          Show
          Dennis Brakhane added a comment - I looked at the transaction log files myself, and they are indeed corrupted. What I have gathered so far is this: The corrupt entry is the first entry in btm1.log, the record type is 08h and the record size is given as 4Eh bytes, while in fact it is 7 bytes smaller. Interestingly, the broken entry contains only one "unique name", there are other entries which are of type 08h and 4Eh bytes long, these contain 2 unique names, the second one being 5 bytes long. (I have only looked at one of the corrupted logs, but as the other two also seem to be corrupt at pos 21, I assume that they have the same problem)
          Hide
          Dennis Brakhane added a comment -

          After further analysis, the crc32 in the corrupted entry is also only correct if you assume that there were originally 2 unique names. That is, it is incorrect for the actual record that was written.

          It seems like somehow, one entry in uniqueNames in TransactionRecord gets lost... Weird.

          Show
          Dennis Brakhane added a comment - After further analysis, the crc32 in the corrupted entry is also only correct if you assume that there were originally 2 unique names. That is, it is incorrect for the actual record that was written. It seems like somehow, one entry in uniqueNames in TransactionRecord gets lost... Weird.
          Hide
          Dennis Brakhane added a comment -

          Created a follow up bug report http://jira.codehaus.org/browse/BTM-44

          Show
          Dennis Brakhane added a comment - Created a follow up bug report http://jira.codehaus.org/browse/BTM-44
          Hide
          Ludovic Orban added a comment -

          Since you identified the root cause of the disk corruption and we can safely conclude that it is not related to this issue I'm now marking it as resolved.

          Show
          Ludovic Orban added a comment - Since you identified the root cause of the disk corruption and we can safely conclude that it is not related to this issue I'm now marking it as resolved.

            People

            • Assignee:
              Ludovic Orban
              Reporter:
              Dennis Brakhane
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: