Help! My RTC Database is getting big!

Many customers who have been using RTC for some time are seeing their database size grow, in some cases, to 100s of GBs if not TBs of data.  While growth is to be expected and is of no issue to modern DBMSes, proactive administrators want to understand the growth and how it can be mitigated especially in light of anticipated user growth in their environments.  Naturally, they come to us and ask what can be done.  While at this time we don’t have the solutions many customers are asking for (project copy, move, archive, delete, etc.), that isn’t to say we don’t have approaches that may be of value in some situations.

Working with our own self-hosted RTC environment as well as those of our enterprise customers, we generally find that the largest contributors to database size our build results, work item attachments and versioned content.  How would you know that?  Fortunately, we have a couple of out of the box reports you can run: Latest Repository Metrics and Latest Repository Metrics by Namespace.  Below are some samples showing a subset of the available namespaces and item types.

image

image

Looking at all the namespaces and item types begs the question…what do they all mean?  Yeah, they aren’t all obvious to me either.  Luckily, I have access  to the smart developers who wrote this stuff and can tell me.  If you find one you don’t know, send me a note/comment or post it on the jazz.net forums.  image

Once you find the larger contributors to size, the next questions asked are can they be deleted and who (that is, which project) is producing them.  In keeping with my team’s No Nonsense Tech Talk theme, I’ll be honest, there’s not much we can delete/archive and we certainly can’t do it at a project level, which would be of greater value, nor can we easily tell who produced it all.  It’s not all doom and gloom because there are some things we can do.

As mentioned earlier, we can delete build results, which often are a huge contributor to size growth.  We can delete work items, even attachments from work items.  Versioned content can be deleted, though you don’t usually want to do that, except for security reasons or to remove binaries versioned by mistake.  Then there are plans, streams, workspaces, etc, that can be deleted, but these don’t tend to take up much space.

So what happens when something is deleted? Well, in some cases, it’s not really removed from the database, only reference to it is removed or made less accessible.  For example, work item attachments don’t really go away when removed from a work item.  Try this.  Add an attachment to a work item then save the work item.  Hover over the attachment link and copy it.  Now remove the attachment from the work item then save the work item.  In the browser, paste the copied attachment URL and it will be found.  Similarly, if you delete a work item that has attachments, the attachment links still remain valid.  However, if you delete (not remove) a work item from the Eclipse client, the work item is actually deleted.

If you find that you’ve removed but not deleted an attachment, it is possible to go back and have it truly deleted.  To do so, using the Eclipse client, paste the URL to the attachment (which should be visible in a discussion comment from when it was first attached) somewhere into the work item (into a comment or the description), right click over that link and select “add to favorites”. Once it is in the favorites, you can drag it from Favorites and drop onto the Attachments section, which re-attaches it to the work item, at which point you can then delete it in the normal way.

Now some things like build results and versioned content once deleted can truly be deleted from the database.

At the repository level there are two key background tasks that are used to mark content as deletable then later deletes it.

  • an “Item Cleanup Task” background task at repository level marks newly orphaned content blobs as needing deletion (runs every 17 mins by default)
  • a “Deleted Items Scrub Task” background task at repository level deletes any content blobs that have been orphaned for more than ~2 hours (runs every 24 hours by default)

Once these both run, any content blobs that were deleted more than 2 hours ago should be fully deleted from the database.

However, DBMSes (particularly those in production) don’t generally release storage allocated for their tables immediately.  A compaction task usually needs to be run to reclaim the disk space.  The DBMS should have tools to indicate in advance how much space can be reclaimed by compaction.  Typical utilities to do this are shown below.  Details for using them should be left to a qualified DBA.

  • Oracle – ALTER TABLE … SHRINK SPACE
  • DB2 – REORG
  • SQL Server – SHRINKDATABASE

My teammate Takehiko Amano has done a very nice job of showing how deleting versioned content and later running DB2 commands reduces database size.  See his article Reducing the size of the Rational Team Concert repository database.

We find the build results often take up a good amount of the RTC database size.  These results often include all the log files from compilations, tests and other activities performed during the build.  Some times they will contain downloadable outputs, e.g. application archives and executables.  What happens is these results are often kept around and never deleted.  In some cases, especially for release milestones, they should be, but all those personal or interim continuous builds don’t need to be kept.  Build results can be deleted and will result in their database content being orphaned and subsequently deleted per the aforementioned process.  Rather than manually deleting results, consider setting up a pruning policy to automatically delete old results.  For those results you want to keep around and not be pruned, just mark them as not deletable.

In cases you know your build results are taking up a lot of space, the natural follow on question is which builds and who owns them.  Our development team recently had cause to address that question which resulted in a very useful script written by Nick Edgar, the RTC Build Component Lead.

Nick created a Groovy script that collects into a CSV file the pruning policy settings and footprint data for each build in all projects across a repository.

SNAGHTML2fbd67b2

He further creates an interactive HTML report that parses the CSV file for display in a more visual form.  image

With this information you can find out which builds from which projects are taking up the most space and whether they have a pruning policy in place.  Armed with this information, an administrator could go to the appropriate release teams and take action.  Imagine running it weekly and posting to a dashboard or emailing it weekly to release teams.  The Groovy script to collect the data and index.html to render the report are attached to task work item 330478.

For gathering the CSV data you’ll need to 1) install Groovy, 2) install a build system toolkit that’s compatible (ideally at same version) as RTC server, 3) set environment variables (see top of Groovy script), and 4) run the script with: groovy -cp “$BUILD_TOOLKIT/*” <groovy file name> <any arguments needed by script>.

For the chart, just put the chart index.html file and CSV in the same directory and open the HTML file. Some browsers will require these to be served up by a web server to allow the HTML file to read the CSV file. For my testing, I used Python’s simple server support for this: python -m SimpleHTTPServer.

Given I am referencing code samples, I’ll keep our lawyers happy by stating that any code referenced is derived from examples on Jazz.net as well as the RTC SDK. The usage of such code is governed by this license. Please also remember, as stated in the disclaimer, that this code comes with the usual lack of promise or guarantee. Enjoy!

Being able to monitor the size and growth of your data, getting granular and actionable information about it and ultimately, being able to do something positive about that growth, is a key concern for IBM Rational and something we are exploring as part of our Platinum Initiative.  I welcome your input in this area.  Perhaps we can interact at IBM InterConnect 2015.

Automating changes to a build workspace configuration

Here’s the scenario.  A build engineer gets a request to establish a build workspace and corresponding build workspace.  He creates the build workspace, gets the desired components added to it, sets the flow targets then changes ownership of the workspace to the build user’s functional ID.  After that, he creates the build definition and associates it with the newly created workspace.  All is good and developers can begin to use the new build definition.  After a time, a request comes in to change the configuration of the build workspace, specifically the components to use.  Since the build workspace is owned by the build user, any changes to it must be made by that user.  Many organizations eschew the use of functional IDs or at least minimize who knows their credentials and are concerned about extra maintenance brought on by password expiration rules.  What to do?

This specific scenario came up recently with a customer of mine.  In particular, their build workspaces have three flow targets.  The current/default is the stream with the application source and it is scoped to include a subset of the components in the stream.  The other two streams are build script and configuration related.  At times, the release team needs to change the components included in the scope of the application source stream.  They do so currently by logging in as the build user, something they detest doing.

What they much prefer is support for team ownership of repository workspaces but that isn’t currently possible (though requested via 271760).  We instead proposed a solution that put the workspace configuration change in the build script which is already performed as the build user and owner of the workspace.

As of 4.0.1, the SCM CLI includes the capability to add/change workspace flow targets.  It was later refactored in 4.0.6 to the current verb-oriented form.

${scmExe} –non-interactive set flowtarget -r ${repositoryAddress} -u ${userId} -P ${password} ${team.scm.workspaceUUID} ${targetStreamUUID} –flow-components ${componentsToLoad}

Where:

  • scmExe – path to scm CLI executable
  • repositoryAddress – URL of CCM server
  • userId – build user ID
  • password – password of build user
  • team.scm.workspaceUUID – build repository workspace UUID
  • targetStreamUUID – stream flow target to set component scope
  • componentsToLoad – space delimited list of components

Unfortunately the SCM CLI does not understand the password file format used by the Ant tasks. You either need to give it as plain text using –P shown above or login outside the build system with the option to remain logged in (on each build machine).

A simple way to get the UUID of a workspace or stream is to open it up in the Eclipse editor and select Copy URL workspace editor menu in the view header or browse to it in the Web UI to get the URL.  The end portion of the URL is the UUID.  For example, the UUID for the repository workspace URL shown below is _GVqXYLRpEeOdavKqgVc36Q

https://clm.jkebanking.net:9443/ccm/resource/itemOid/com.ibm.team.scm.Workspace/_GVqXYLRpEeOdavKqgVc36Q

For components with spaces in their name, care must be taken to offset them by appropriate quotes.  For my tests, single quotes worked on Linux and double on Windows.

The command will set the scope of the specified flow target (e.g. application development stream) for the specified workspace (e.g. build workspace) to include only those components in the specified list.  The current configuration of the component(s) in the target stream is used.

Note if multiple flow targets exist for the workspace and a component listed is included in multiple flow targets, then scoped flows need to be used to avoid conflicts.  See How should my source code be loaded from Jazz SCM?

Let’s take a look at an example.  Below, the build.brm.continuous repository workspace, owned by the build user, has six components and three flow targets in its configuration. Banking Logic, Database, Java UI and Prerequisites are from the BRM StreamBuild comes from Build Scripts and Build Config comes from the stream of the same name.

image

The BRM Stream is scoped to only include only a subset of the components.

SNAGHTML9b0895d

Assume that we wanted to add the Database component to the scope.  Using SCM CLI command similar to that shown in the screenshot below, the scope can be changed to add it in.

SNAGHTML9bded5d

This results in the flow target scope being changed.

SNAGHTML9beca53

Now to add this to the build script so it can be automated.  Observe in the editor screenshot below that the command has been added to an exec statement in an Ant build script.  image

The targetStreamUUID and componentsToLoad values need to be passed to the build script.  Add these as properties of the build definitions.  For example:

image

When the build is requested, the componentsToLoad value can be changed.  In our original example, the Database component can be added in by editing the componentsToLoad build property at the time of the build request.

SNAGHTML9cfe091

Should you be concerned that adding the ‘set flowtarget’ command to the build script will add unnecessary overhead, albeit very minimal, to every build execution even when component list isn’t changed, you can create a build script and definition that only performs the ‘set flowtarget’ command and run it when needed.

The example shown was for changing the flow target component scope.  The technique used can be applied for other build workspace manipulations needed and supported by the SCM CLI.

Thanks to Nick Edgar, RTC Build component lead, for making me aware of the ‘set flowtarget’ command and suggesting its application to the problem described.