Many customers who have been using RTC for some time are seeing their database size grow, in some cases, to 100s of GBs if not TBs of data. While growth is to be expected and is of no issue to modern DBMSes, proactive administrators want to understand the growth and how it can be mitigated especially in light of anticipated user growth in their environments. Naturally, they come to us and ask what can be done. While at this time we don’t have the solutions many customers are asking for (project copy, move, archive, delete, etc.), that isn’t to say we don’t have approaches that may be of value in some situations.
Working with our own self-hosted RTC environment as well as those of our enterprise customers, we generally find that the largest contributors to database size our build results, work item attachments and versioned content. How would you know that? Fortunately, we have a couple of out of the box reports you can run: Latest Repository Metrics and Latest Repository Metrics by Namespace. Below are some samples showing a subset of the available namespaces and item types.
Looking at all the namespaces and item types begs the question…what do they all mean? Yeah, they aren’t all obvious to me either. Luckily, I have access to the smart developers who wrote this stuff and can tell me. If you find one you don’t know, send me a note/comment or post it on the jazz.net forums.
Once you find the larger contributors to size, the next questions asked are can they be deleted and who (that is, which project) is producing them. In keeping with my team’s No Nonsense Tech Talk theme, I’ll be honest, there’s not much we can delete/archive and we certainly can’t do it at a project level, which would be of greater value, nor can we easily tell who produced it all. It’s not all doom and gloom because there are some things we can do.
As mentioned earlier, we can delete build results, which often are a huge contributor to size growth. We can delete work items, even attachments from work items. Versioned content can be deleted, though you don’t usually want to do that, except for security reasons or to remove binaries versioned by mistake. Then there are plans, streams, workspaces, etc, that can be deleted, but these don’t tend to take up much space.
So what happens when something is deleted? Well, in some cases, it’s not really removed from the database, only reference to it is removed or made less accessible. For example, work item attachments don’t really go away when removed from a work item. Try this. Add an attachment to a work item then save the work item. Hover over the attachment link and copy it. Now remove the attachment from the work item then save the work item. In the browser, paste the copied attachment URL and it will be found. Similarly, if you delete a work item that has attachments, the attachment links still remain valid. However, if you delete (not remove) a work item from the Eclipse client, the work item is actually deleted.
If you find that you’ve removed but not deleted an attachment, it is possible to go back and have it truly deleted. To do so, using the Eclipse client, paste the URL to the attachment (which should be visible in a discussion comment from when it was first attached) somewhere into the work item (into a comment or the description), right click over that link and select “add to favorites”. Once it is in the favorites, you can drag it from Favorites and drop onto the Attachments section, which re-attaches it to the work item, at which point you can then delete it in the normal way.
Now some things like build results and versioned content once deleted can truly be deleted from the database.
At the repository level there are two key background tasks that are used to mark content as deletable then later deletes it.
- an “Item Cleanup Task” background task at repository level marks newly orphaned content blobs as needing deletion (runs every 17 mins by default)
- a “Deleted Items Scrub Task” background task at repository level deletes any content blobs that have been orphaned for more than ~2 hours (runs every 24 hours by default)
Once these both run, any content blobs that were deleted more than 2 hours ago should be fully deleted from the database.
However, DBMSes (particularly those in production) don’t generally release storage allocated for their tables immediately. A compaction task usually needs to be run to reclaim the disk space. The DBMS should have tools to indicate in advance how much space can be reclaimed by compaction. Typical utilities to do this are shown below. Details for using them should be left to a qualified DBA.
- Oracle – ALTER TABLE … SHRINK SPACE
- DB2 – REORG
- SQL Server – SHRINKDATABASE
My teammate Takehiko Amano has done a very nice job of showing how deleting versioned content and later running DB2 commands reduces database size. See his article Reducing the size of the Rational Team Concert repository database.
We find the build results often take up a good amount of the RTC database size. These results often include all the log files from compilations, tests and other activities performed during the build. Some times they will contain downloadable outputs, e.g. application archives and executables. What happens is these results are often kept around and never deleted. In some cases, especially for release milestones, they should be, but all those personal or interim continuous builds don’t need to be kept. Build results can be deleted and will result in their database content being orphaned and subsequently deleted per the aforementioned process. Rather than manually deleting results, consider setting up a pruning policy to automatically delete old results. For those results you want to keep around and not be pruned, just mark them as not deletable.
In cases you know your build results are taking up a lot of space, the natural follow on question is which builds and who owns them. Our development team recently had cause to address that question which resulted in a very useful script written by Nick Edgar, the RTC Build Component Lead.
Nick created a Groovy script that collects into a CSV file the pruning policy settings and footprint data for each build in all projects across a repository.
He further creates an interactive HTML report that parses the CSV file for display in a more visual form.
With this information you can find out which builds from which projects are taking up the most space and whether they have a pruning policy in place. Armed with this information, an administrator could go to the appropriate release teams and take action. Imagine running it weekly and posting to a dashboard or emailing it weekly to release teams. The Groovy script to collect the data and index.html to render the report are attached to task work item 330478.
For gathering the CSV data you’ll need to 1) install Groovy, 2) install a build system toolkit that’s compatible (ideally at same version) as RTC server, 3) set environment variables (see top of Groovy script), and 4) run the script with: groovy -cp “$BUILD_TOOLKIT/*” <groovy file name> <any arguments needed by script>.
For the chart, just put the chart index.html file and CSV in the same directory and open the HTML file. Some browsers will require these to be served up by a web server to allow the HTML file to read the CSV file. For my testing, I used Python’s simple server support for this: python -m SimpleHTTPServer.
Given I am referencing code samples, I’ll keep our lawyers happy by stating that any code referenced is derived from examples on Jazz.net as well as the RTC SDK. The usage of such code is governed by this license. Please also remember, as stated in the disclaimer, that this code comes with the usual lack of promise or guarantee. Enjoy!
Being able to monitor the size and growth of your data, getting granular and actionable information about it and ultimately, being able to do something positive about that growth, is a key concern for IBM Rational and something we are exploring as part of our Platinum Initiative. I welcome your input in this area. Perhaps we can interact at IBM InterConnect 2015.