Version pruning with Powershell and the FileHold API

FileHold is greedy. It keeps everything (maybe it is a hoarder). Whether it is the log of everything that happens to a document or the contents of the document itself, FileHold remembers. Since FileHold allows you to store any type of file, it makes no assumptions about the best way to know one version from the next; it keeps both complete files.

In the world of near unlimited low-cost storage, extra copies of files are often not an issue. The impact of extra copies can grow when you think about onsite backups, offsite backups, fast storage, slow storage, etc. In some cases, these versions are of no importance once a final version is complete and there may be a very large number of them. This results in requests to delete unneeded versions or version pruning. There are probably as many criteria for version pruning as people that ask me what features FileHold has for version pruning.

The good news is that there is a simple method to remove old versions. In the standard FileHold clients you can simply open the version history, select the versions you no longer want and delete. Of course, these deleted versions will go to the soft delete queue to await their permanent removal. Since it is available in the client, you know it is available in the API and some Powershell code can easily add some automation for those cases where you need to tackle many documents at once.

Time to warn you about the risks of making mass updates and using the API in general. Make sure you have backups, test your scripts on your test system, blah, blah and, if you are not here for light reading, it pays to have a bit of SQL and Powershell skill.

If you have not read my Powershell blog articles before you may want to have a quick look back as I cover the basics of authenticating to FileHold and forming web service calls using Powershell there. At the end of this article I have included the complete code discussed here and the supporting code.

There are two parts to this version pruning solution: the brains and the brawn. We will build the brains in SQL and the brawn in Powershell with some FileHold API calls. You could build the whole thing in Powershell, but SQL is so good and fast at finding very specific things, that makes more sense. It does mean that when you want to do your version pruning you will need access to SQL Server and to the FileHold server from where ever this code will run. We hope to be able to do something like the following:

Invoke-SqlCmd -Database ch_librarymanager -Query $myDocumentVersionQuery | Remove-DocumentVersions

This command assumes we are using Integration Windows Authentication and the query to select the correct document versions is in $myDocumentVersionQuery.

For the brawn side of things, we are expecting a list from the Powershell pipeline of the internal document version ids from the SQL query. As in past blog articles, we will create the Begin, Process, and End sections of our Powershell script to collect the version ids and then call the delete method in the API.

Aside from logging into FileHold we will initialize our list of document versions that will be deleted. We are going to keep a list of document versions as we go and delete them at the end of the script as it is more efficient.

$documentVersionIdsToDelete = New-Object -TypeName System.Collections.ArrayList -ArgumentList ( $null )

Our process section will also be simple as we will just add the latest id to our list.

foreach ( $documentVersionId in $ValueList ) {
   $documentVersionIdsToDelete.Add( [Int32]$documentVersionId ) }

The meat is in the end section. On the surface it seems trivial as the API provides the DeleteDocuments method in the LibraryManager.DocumentManager service. The catch is that DeleteDocuments takes a Selection as its only parameter.

Creating a Selection is a two-part exercise as first you must have Snapshot created by a search using GetDocumentsBySnapshot. The good news is that we have a list of document version ids as our input and that is one of the supported search types.

$snapShotId = [GUID]::Empty
$searchCriteria = New-Object (LibraryManager.SearchCriteria)
[System.Collections.ArrayList]$conditions = @()
$i = $conditions.Add( (New-Object (LibraryManager.SearchCondition)) )
    $conditions[$i].SearchType = [DocumentFinder.SearchType]::DocumentVersionId.value__
    $conditions[$i].OperatorType = [DocumentFinder.Operator]::InList.value__
    $conditions[$i].Operands = $documentVersionIdsToDelete
$searchCriteria.SearchConditions = $conditions
$searchResults = (DocumentFinder).GetDocumentsBySnapshot( [GUID]::Empty, [ref]$snapshotId, 'SRT', $searchCriteria, $null, 0, 100 )

In this example, we are limiting the number of search results we have available to 100. You can increase this number, but you need to have enough RAM for the data that will be returned. Increasing it beyond the maximum size of search results will not return more values. If you are deleting a large number of document versions, you may want to plan to iterate across some reasonable number until all desired versions have been deleted.

We now have our snapshot, but we need to get to a selection. In the user interface a user may pick one or more items from the search results, but we will simply pick all. Creating a selection requires arrays of the document ids and metadata version ids, but we can easily pull them from the search results with a little Powershell magic.

$snapshotSelection = New-Object (LibraryManager.SnapshotSelection)
$snapshotSelection.SnapshotId = $snapshotId
$snapshotSelection.DocumentIdList = @( $searchResults.DocumentValues.ForEach( { $_.DocumentId } ))
$snapshotSelection.MetadataVersionidList = @( $searchResults.DocumentValues.ForEach( { $_.MetadataVersionId } ))
$selection = New-Object (LibraryManager.Selection)
$selection.SnapshotSelection = @( $snapshotSelection )
$selectedDocumentVersions = (DocumentManager).CreateSelection( $selection )

Finally, we make the big call.

(DocumentManager).DeleteDocuments( $selectedDocumentVersions, $false )

With the brawn out of the way, we can work on the brains of this operation. It pays to know a bit about some key tables and columns in the library manager database. Each record in DocumentVersions matches an actual version in FileHold. MetadataVersions is a universal link between document properties tables and document versions. Every time properties get updated, a new metadata versions record is created. For our example we are also going to need the DocumentSchemas table so we can use human readable names for our where clause.

In our example we are going to prune versions where the latest version of the document has been approved in a workflow. That means we need a list of records that we will use to select the real list of records. I smell a nested SELECT.

What we want is document ids for all the documents that are candidates for version pruning. The latest document version must be approved and it should be from the schemas we are interested in.

SELECT dv.[DocumentId]
FROM [DocumentVersions] dv
    INNER JOIN MetadataVersions mv ON mv.DocumentVersionId = dv.DocumentVersionId
    INNER JOIN DocumentSchemas ds ON ds.DocumentSchemaId = mv.DocumentSchemaId
WHERE ApprovalStatus = 2
    AND dv.islast = 1 -- only compare with the latest version
    AND dv.isdeleted = 0 -- ignore versions that have been deleted
    AND ds.Name IN  ( 'Schema-1', 'Schema-2' )

As a side note, magic numbers like the approval status can be found in the API reference guide. Although the guide does not refer directly to the database, the enumerations translate exactly to the database values. If you search the API reference index for “ApprovalStatus”, you will find the enumeration and the approved value which is 2.

The main query will be a little simpler. For all the documents we found that were approved for Schema-1 and 2, we will find all their previous versions that are not already deleted. To keep the example simple, we will select only 100 records to match our page size in the snapshot selection.

SELECT TOP 100 DocumentVersionId
FROM DocumentVersions dvMain
WHERE dvMain.DocumentId IN ( … the select statement above goes here … )
    AND dvMain.IsLast = 0
    AND dvMain.IsDeleted = 0

And, that is that. We can simply assign that query to the $myDocumentVersionQuery variable in the example at the top.

As always, if you do not want to tackle something like this on your own, our professional services team is at your disposal.



Russ Beinder

Russ Beinder is the Chief Technology Officer at FileHold. He is an entrepreneur, a seasoned business analyst, computer technologist and a certified Project Management Professional (PMP). For over 35 years he has used computer technology to help organizations solve business problems.