currently, the discovery mechanism is geared to walking a source repository and insert into the target repository. However, there is no way to determine if an artifact has been removed. We may need to track this - potentially using metadata or the repository index.
Description
currently, the discovery mechanism is geared to walking a source repository and insert into the target repository. However, there is no way to determine if an artifact has been removed. We may need to track this - potentially using metadata or the repository index.
It is better not to check the target at all but to record all the information inside the repository being discovered. The best alternative for this is the index - so when the indexer callback checks for deletions (which can be done using the technique described below with the listener), deletions found should be recorded in the root metadata for other processes to pick up, handle and clear as they come by.
This does mean that it may operate like this:
1) file is deleted
2) converter discovers no change
3) indexer discovers change,
4) converter finds change recorded for it, deletes from target
This delay should not be significant as long as the non-indexing discoverers regularly check for deletions.
This does require that we can enumerate all operations on a repository so that they can be added for each.
We will want to be able to only test deletions on a less frequent interval.
other thoughts in case this doesn't pan out:
I can see two possible solutions:
1) on an alternate run, the caller passes in a list of things it knows about, and the ones that have gone missing are passed back for deletion from the caller (indexer, converter, etc)
2) metadata is written to each artifact directory
The first seems the most efficient to me, but has a high memory requirement if there are a lot of artifacts (and would require caching to avoid having to read the entire index/rediscovering the entire target repository/etc).
Once we convert to a listener, a better way would be for the listener to have two methods:
processModifiedArtifact() - for new or changed artifacts
processUnmodifiedArtifact() - generally do nothing, but notes its existence
at the completion of discovery, the caller can compare the list of discovered artifacts to its internal list and detect deletions. If it is easy to get a count of the target then this will be very quick when unchanged. The most problematic is the target repository on converter which will require a "reverse discovery" to find the old artifacts.
Brett Porter added a comment - 24/Jul/06 01:44 AM It is better not to check the target at all but to record all the information inside the repository being discovered. The best alternative for this is the index - so when the indexer callback checks for deletions (which can be done using the technique described below with the listener), deletions found should be recorded in the root metadata for other processes to pick up, handle and clear as they come by.
This does mean that it may operate like this:
1) file is deleted
2) converter discovers no change
3) indexer discovers change,
4) converter finds change recorded for it, deletes from target
This delay should not be significant as long as the non-indexing discoverers regularly check for deletions.
This does require that we can enumerate all operations on a repository so that they can be added for each.
We will want to be able to only test deletions on a less frequent interval.
other thoughts in case this doesn't pan out:
I can see two possible solutions:
1) on an alternate run, the caller passes in a list of things it knows about, and the ones that have gone missing are passed back for deletion from the caller (indexer, converter, etc)
2) metadata is written to each artifact directory
The first seems the most efficient to me, but has a high memory requirement if there are a lot of artifacts (and would require caching to avoid having to read the entire index/rediscovering the entire target repository/etc).
Once we convert to a listener, a better way would be for the listener to have two methods:
processModifiedArtifact() - for new or changed artifacts
processUnmodifiedArtifact() - generally do nothing, but notes its existence
at the completion of discovery, the caller can compare the list of discovered artifacts to its internal list and detect deletions. If it is easy to get a count of the target then this will be very quick when unchanged. The most problematic is the target repository on converter which will require a "reverse discovery" to find the old artifacts.
added code for cleaning up the database of artifacts that are no longer existing in the repository
(DatabaseCleanupRemoveArtifactConsumer and DatabaseCleanupRemoveProjectConsumer)
created tests for database cleanup of removed artifacts
updated some of the test cases (in archiva-database and archiva-scheduled modules) to reflect the changes in thedb cleanup consumers
The cleaning up of the index was not yet included here as i suspect the locking problem (same as with the repository purge) will occur. I'll open a separate jira for this.
Maria Odea Ching added a comment - 15/Oct/07 06:20 AM Fixed in -r584735
These were the changes made:
added code for cleaning up the database of artifacts that are no longer existing in the repository
(DatabaseCleanupRemoveArtifactConsumer and DatabaseCleanupRemoveProjectConsumer)
created tests for database cleanup of removed artifacts
updated some of the test cases (in archiva-database and archiva-scheduled modules) to reflect the changes in thedb cleanup consumers
The cleaning up of the index was not yet included here as i suspect the locking problem (same as with the repository purge) will occur. I'll open a separate jira for this.
Thanks!
This does mean that it may operate like this:
1) file is deleted
2) converter discovers no change
3) indexer discovers change,
4) converter finds change recorded for it, deletes from target
This delay should not be significant as long as the non-indexing discoverers regularly check for deletions.
This does require that we can enumerate all operations on a repository so that they can be added for each.
We will want to be able to only test deletions on a less frequent interval.
other thoughts in case this doesn't pan out:
I can see two possible solutions:
1) on an alternate run, the caller passes in a list of things it knows about, and the ones that have gone missing are passed back for deletion from the caller (indexer, converter, etc)
2) metadata is written to each artifact directory
The first seems the most efficient to me, but has a high memory requirement if there are a lot of artifacts (and would require caching to avoid having to read the entire index/rediscovering the entire target repository/etc).
Once we convert to a listener, a better way would be for the listener to have two methods:
at the completion of discovery, the caller can compare the list of discovered artifacts to its internal list and detect deletions. If it is easy to get a count of the target then this will be very quick when unchanged. The most problematic is the target repository on converter which will require a "reverse discovery" to find the old artifacts.