Details
Description
I am trying to retrieve the record CounterVar (id=4, type=1) - see mapping file description below - in one transaction. In the same program, with a different transaction, I am trying to retrieve the record CounterVar (id=1, type=1) in the DbLocked mode. Note that in both cases the type field references the same record. This results in my second transaction getting LockNotGrantedException (writeLockTimeout). When I rollback both transactions and retry the retrieve (in the context of the same process), I observed that the call to QueryResults.size() returns 1, but the call to QueryResults.hasMore() returns false. On further investigation, I found that the call to hasMore() sees the following exception stack trace:
org.exolab.castor.jdo.ObjectNotFoundException: The object of type CounterVar with identity 1 was not found in persistent storage
at org.exolab.castor.persist.LockEngine.load(LockEngine.java:385)
at org.exolab.castor.persist.TransactionContext.load(TransactionContext.java:651)
at org.exolab.castor.persist.QueryResults.fetch(QueryResults.java:229)
at org.exolab.castor.jdo.engine.OQLQueryImpl$OQLEnumeration.hasMore(OQLQueryImpl.java:633)
at org.exolab.castor.jdo.engine.OQLQueryImpl$OQLEnumeration.hasMore(OQLQueryImpl.java:614)
at TestWriteLockTimeout.main(TestWriteLockTimeout.java:84)
In looking at the code, I see that LockEngine.load() is actually converting the exception ObjectDeletedWaitingForLockException into ObjectNotFoundException.
I am guessing that this is happening because of the previous (unsuccessful) retrieve.
-
- bug1247.tar
- 27/Oct/05 3:20 AM
- 20 kB
- Vishal Madan
-
- output.txt
- 07/Nov/05 5:01 AM
- 2 kB
- Swagata De Khan
-
- patch.C1247.20051112.bug-report.txt
- 13/Nov/05 9:55 AM
- 19 kB
- Werner Guttmann
-
- patch.C1247.20051112.txt
- 13/Nov/05 9:52 AM
- 3 kB
- Werner Guttmann
-
- TestWriteLockTimeout.java
- 07/Nov/05 5:00 AM
- 5 kB
- Swagata De Khan
Activity
Test case attached.
NOTE: On Windows when I execute this program, the second transaction hangs trying to retrieve the record (id=1). It comes back from the DB retrieve and then indefinately waits to acquire the write lock (which should actually time-out after 10s). I dont know why this happens. Is this some strange Windows quirk??
How do you set the timeout ? At the RDBMS level, or at Castor ?
We have set the DB timeout for write-locks to 300s. We don't set any timeout values in Castor, but I believe it defaults to 10s. Doesn't it? I have seen the 10s lock wait timeout in Castor in several of my tests.
Vishal, just out of curiosity, is this your problem a result of switching to 0.9.9 ? Iow, can you produce the same output with an older version of castor, e.g. 0.9.7 ?
Werner, I came across the issue in 0.9.5.3. I then tried it out with 0.9.9 and the same problem exists there too. I have not tried any version in between.
Vishal, just executing the test case on Windows Xp, and I am getting the LockNotGrantedExceptionn after 10s or so.
Vishal, it it on intention that you start a second transaction nested within the first one. Afair, Castor does not support this, and even if it does, I'd expect the LockNotGrantedException to be thrown as a result of loading the CounterType with id 1 as there's still a write-lock associated with this record in the database (as the first transaction has not been committed/rolled back yet).
Werner, Yes I have mentioned the quirky Windows behavior in one of the comments that I added to this issue. Seems like in Windows one does not get a write lock timeout exception at all. That is another thing that someone may want to look into (since we do expect to get the exception after the timeout has been reached).
And yes again, it is my intention to have a nested transaction retrieve the dependent record. And I completely do expect to see the Writelock timeout. My problem is with the ObjectNotFoundException. When the retry is done, I should be able to repeatedly get the WriteLock exception, which is not happening.
Oops... sorry about the Windows update. I thought you wrote that you could not see the error. Which version of JDK/JRE are you using?
Vishal, just to get a couple of things straight:
a) I do get a LockNotGrantedException upon loading the same object.
b) Apart from this exception, the test case completes successfully.
c) Iow, I do not get any other exceptions.
Now, I thought that test case were meant to show something else which you consider wrong. I assume it's me who does get something wrong, rigth ?
Werner,
In the 1st attempt we are getting the LockNotGrantedException from the 2nd transaction. So we are doing a rollback and retry.
In the 2nd attemp, we are not getting this exception. Instead, the hasmore() method call on the queryResults object obtained from the 2nd transaction is returning false and the program ends there immediately after this 2nd attempt.
We are expecting to get the same LockNotGrantedException from the 2nd transaction in all the 4 retries.
Thanks for explaining again ... and sorry for being so slow to pick this up. I thought that the JUnit test case would highlight this fact to me by e.g. using fail() to signal that a particular expectation has not been met. I will have a closer look later on ....
I have modified vishal's testcode a little to format the output. I am attaching the testcode(TestWriteLockTimeout.java) as well as the output.txt so that you can understand the situation better.
Werner, Any idea what is going wrong here? I've been trying to go through the code, but without any success (I kind of would expect that since the entire code is new to me - so it's taking me some time to understand how it works).
By the way, you didnt tell me what JDK version you are using on Windows. I'm using 1.4.1 and I see the problem with the retrieve call waiting indefinately and not retruning the LockWaitTimeout exception.
Btw, I still have not found any time to step through your problem. By the way it looks I will be able to do so tonight.
Thanks. A solution will really help. I have my mgmt breathing down my neck for this one.
I am not 100% sure, but I think that the problem is with releasing the locks properly after the LockNotGrantedxception has been thrown. Because of this, when trying to load the CounterCar instance again after having rolled back the first time, the load will fail as the ObjectLock instance stored with the transaction
has (still) the deleted flag set to true .. which apparently is wrong.
Werner, After spending some time with the code I traced the following behavior:
In the method LockEngine.load(...) the following things happen:
- typeInfo.acquire(...) is called which results in the locks being taken, or it throws the LockNotGranterException
- typeInfo.molder.load(...) is called which creates the object to be returned back to the application. This method will also recursively call LockEngine.load(...) for any dependent objects to be retrieved.
So, in the test program the first transaction is able to get the lock on both CounterVar (id=1) and CounterType (id=1) objects. Then when the next transaction is retrieving it is able to get the lock for the CounterVar(id=4) object i.e. the call to typeInfo.acquire(...) succeeds and then while in the call to typeInfo.molder.load(...) there is a call back into LockEngine.load(...) for loading the CounterType (id=1) object. Here the call to typeInfo.acquire(...) throws LockNotGrantedException. Since no lock is taken, in the finally block, the ObjectLock.confirm(...) method is not called. When the exception propagates back again into LockEngine.load(...) for the CounterVar(id=4) object as part of unwinding, a valid lock is present as a result of which ObjectLock.confirm(...) gets called in the finally block. This sets the _deleted flag in ObjectLock.
Now, I am assuming that this flag should normally be cleared up during the transaction rollback/commit, but since the retrieve was really never successful, the transaction has no reference to these objects. As a result the cleanup really never happens.
I tried out some changes for this:
In LockEngine.load(...), just before the finally block and after the catch block for ObjectDeletedWaitingForLockException I added a catch block for LockNotGrantedException as follows:
} catch ( ObjectDeletedWaitingForLockException except )
{ // This is equivalent to object does not exist throw new ObjectNotFoundException( Messages.format("persist.objectNotFound", oid.getName(), oid.getIdentity())); }catch ( LockNotGrantedException except )
{ if ( lock != null ) lock.confirm( tx, succeed ); if ( lock != null ) lock.release( tx ); lock = null; throw except; }finally
{ if ( lock != null ) lock.confirm( tx, succeed ); }This does seem to work with my test program. BUT, since i am not very familiar with the Castor code, I am not sure what the possible side effects of such a change would be.
Comments???
Werner, are you busy with something else? Please let me know if you can spend some time on this issue.
Vishal, I had a brief look at your suggested patch, and I think it looks okay. I still need to get the CTF to complete a successful run, but as I am going on a three day workshop right now, this won't be possible before Monday morning, I am afraid.
Werner, Thanks. I was just wondering about the long silence. I guess you are busy.
What is a CTF???
While testing different combinations of lock I came across a problem with the fix that I have proposed. In the attached test case if the access mode for the second retrieve is changed from DbLocked to ReadOnly I see the following exception:
java.lang.IllegalStateException: Internal error: Attempted to release/delete lock without first acquiring the lockCounterVar/4/2 by org.exolab.castor.jdo.engine.TransactionContextImpl@3bc257
at org.exolab.castor.persist.ObjectLock.release(ObjectLock.java:854)
at org.exolab.castor.persist.LockEngine.load(LockEngine.java:389)
at org.exolab.castor.persist.TransactionContext.load(TransactionContext.java:651)
at org.exolab.castor.persist.QueryResults.fetch(QueryResults.java:229)
at org.exolab.castor.jdo.engine.OQLQueryImpl$OQLEnumeration.hasMore(OQLQueryImpl.java:631)
at org.exolab.castor.jdo.engine.OQLQueryImpl$OQLEnumeration.hasMore(OQLQueryImpl.java:614)
at TestReadLockTimeout.main(TestReadLockTimeout.java:74)
So, I made some changes to the fix I proposed earlier:
} catch ( ObjectDeletedWaitingForLockException except )
{ // This is equivalent to object does not exist throw new ObjectNotFoundException( Messages.format("persist.objectNotFound", oid.getName(), oid.getIdentity())); }catch ( LockNotGrantedException except )
{ if ( lock != null ) lock.release( tx ); throw except; }finally
{ if ( lock != null ) lock.confirm( tx, succeed ); }
So, now in the newly added catch block for LockNotGrantedException I just call release(...). Then confirm(...) will get called in the finally block.
For some reason the change in order of the call seems to solve the problem. I am not quiet sure I understand why!! I'm digging deeper into what is going on over here. Comments??
Vishal, just trying to run your patch against the CTF now. CTF, btw, stands for the Castor Test Framework, and is a set of functional (JUnit) tests to verify that Castor is (still) working ,,, to run the complete test suite against e.g mysql, first execute > build jar.jdotests and second >test castor.mysql.
All 53 tests complete succesfullyy after applying this patch. I'd like to see the test case provided by you to be integrated with the CTF after commiting this patch. Still need to make up my mind as to where I'd like to see this added.
Final patch, incl. commit preparations. As already reported, all 53 CTF tests pass successfuly.
Werner, Would you like me to write the JUnit tests for the issue? I modified the test that I had attached to use multiple threads rather than nested transactions for my own testing. I could convert that into a JUnit test and attach it.
Thanks for the offer, but I think I'll do this myself, as I need to make up my mind where exactly to include this within the CTF.
Vishal, can I come back to your offer to supply me with a JUnit test case (based upon your re-write) ?
Sure. I am a little swamped right now, but I should be able to write it for you in the next couple of days.
For whatever reason I am unable to figure out how to attach something to this report. I know it is possible because I've done it before!! Werner.... HELP! Or I could just e-mail it to you?
Vishal, as this issues is closed you can not attach new files to it. If you found that the original issue has not been resolved it should be reopened which would allow you to attach new files thereafter. If you found something that relates to the original issue or you have created a test case I suggest to create a new issue and link it to this one as related.
I will add the test case in some time.