Getting to a right-sized Engineering Lifecycle Management environment

2015-05-24 14.05.27

The following is a refresh of a previous post in 2016.

You’ve just made the decision to adopt one of the Jazz solutions from IBM.  Of course, being the conscientious and proactive IT professional that you are, you want to ensure that you deploy the solution to an environment that is performant and scalable.  Undoubtedly you begin scouring the IBM Knowledge Center and the latest System Requirements.  You’ll find some help and guidance on Deployment and installation planning and even a reference to advanced information on the Deployment wiki.  Unlike the incongruous electric vehicle charging station in a no parking zone, you are looking for definitive guidance but come away scratching your head still unsure of how many servers are needed and how big they should be.

This is a common question I am often asked, especially lately.  I’ve been advising customers in this regard for several years now and thought it would be good to start capturing some of my thoughts.  As much as we’d like it to be a cut and dried process, it’s not.  This is an art not a science.

My aim here is to capture my thought process and some of the questions I ask and references I use to arrive at a recommendation.  Additionally, I’ll add in some useful tips and best practices.

I find that the topology and sizing recommendations are similar regardless of whether the server is to be physical or virtual, on-prem or in the cloud, managed or otherwise.  These impact other aspects of your deployment architecture to be sure, but generally not the number of servers to include in your deployment or their size. One exception is that managed cloud environments often start lower than the recommended target since those managing understand how to monitor the environment, look for indicators that more resources are needed and can quickly respond to increasing demands.

BUS30093

From the outset, let me say that no matter what recommendation I or one of my colleagues gives you, it’s only a point in time recommendation based on the limited information given, the fidelity of which will increase over time.  You must monitor your Jazz solution environment.  In this way you can watch for trends to know when a given server is at capacity and needs to scale by increasing system resources, changing the distribution of applications in the topology and/or adding a new server.  See Deployment Monitoring for some initial guidance.  Since 6.0.3, we have added capabilities to monitor Jazz applications using JMX MBeans. Enterprise monitoring is a critical practice to include in your deployment strategy.

enterprise

Before we even talk about how many servers and their size, the other standard recommendation is to ensure you have a strategy for keeping the Public URI stable which maximizes your flexibility in changing your topology.  We’ve also spent a lot of time deriving standard topologies based on our knowledge of the solution, functional and performance testing, and our experience with customers.  Those topologies show a range in number of servers included.  The departmental topology is useful for a small proof of concept or sandbox environment for developing your processes and procedures and required configuration and customization.  For most production environments, a distributed enterprise topology is needed.

The tricky part is that the enterprise topology specifies a minimum of 8 servers to host just the Jazz-based applications, not counting the Reverse Proxy Server, Database Server, License Server, Directory Server or any of the servers required for non-Jazz applications (IBM or 3rd Party).  For ‘large’ deployments of 1000 users or more that seems reasonable.  What about smaller deployments of 100, 200, 300, etc. users?  Clearly 8+ servers is overkill and will be a deterrent to standing up an environment.  This is where some of the ‘art’ comes in.  I find more often than not, I am recommending a topology that is some where between the department and enterprise topologies.  In some cases, a federated topology is needed when a deployment has separate and independent Jazz instances but needs to provide a common view from a reporting perspective and/or for global configurations, in case of a product line strategy. The driving need for separate instances could be isolation, sizing, reduced exposure to failures, organizational boundaries, merger/acquisition, customer/supplier separation, etc.

The other part of the ‘art’ is recommending the sizing for a given server.  Here I make extensive use of all the performance testing that has been done, including the following.

4.1.1

The CLM Sizing Strategy provides a comfortable range of concurrent users that a given Jazz application can support on a given sized server for a given workload.  Should your range of users be higher or lower, your server be bigger or smaller or your workload be more or less demanding, then you can expect your range to be different or to need a different sizing.  In other words, judge your sizing or expected range of users up or down based on how closely you match the test environment and workload used to produce the CLM Sizing Strategy.  Concurrent use can come from direct use by the Jazz users but also 3rd party integrations as well as build systems and scripts.  All such usage drives load so be sure to factor that into the sizing.  There are other factors such as isolating one group of users and projects from another, that would motivate you to have separate servers even if all those users could be supported on a single server.

Should your expected number of concurrent users be beyond the range for a given application, you’ll likely need an additional application server of that type.  For example, the CLM Sizing Strategy indicates a comfortable range of 400-600 concurrent users on a CCM (Engineering Workflow Management) server if just being used for work items (tracking and planning functions).  If you expect to have 900 concurrent users, it’s a reasonable assumption that you’ll need two CCM servers.  Scaling a Jazz application to support higher loads involves adding an additional server, which the Jazz architecture easily supports through multi-server or clustering topology patterns.  Be aware though that there are some behavioral differences and limitations when working with the multi-server (not clustered) pattern.  See Planning for multiple Jazz application server instances and its related topic links to get a sense of considerations to be aware of up front as you define your topology and supporting usage models.  As of 6.0.6.1, application clustering is only available with the CCM application.

This post doesn’t address other servers likely needed in your topology such as a Reverse Proxy, Jazz Authorization Server (which can be clustered), Content Caching Proxy and License Key Server Administration and Reporting tool.  Be sure to read up on those so you understand when/how they should be incorporated into your topology.  Additionally, many of the performance and sizing references I listed earlier include recommendations for various JVM and application settings.  Review those and others included in the complete set of Performance Datasheets and Sizing Guidelines.  It isn’t just critical to get the server sizing right but the JVM and application properly tuned.

To get to the crux of the primary question of number of servers and their size, I ask a number of questions.  Here’s a quick checklist of them.

  1. What Jazz applications are you deploying?
  2. What other IBM or 3rd party tools are you integrating with your Jazz applications?
  3. How many total and concurrent users by role and geography are you targeting and expect to have initially?  What is the projected adoption rate?
  4. What is the average latency from each of the remote locations?
  5. How much data (number of artifacts by domain) are you migrating into the environment? What is the projected growth rate?
  6. If adopting IBM Engineering Workflow Management, which capabilities will you be using (tracking and planning, SCM, build)?
  7. What is your build strategy? frequency/volume?
  8. Do you have any hard boundaries needed between groups of users, e.g. organizational, customer/supplier, etc. such that these groups should be separated onto distinct servers?
  9. Do you anticipate adopting the global or local configuration management capability?
  10. Do you make significant use of resource-intensive scenarios?
  11. What are your reporting needs? document generation vs. ad hoc? frequency? volume/size?

Most of these questions primarily allow me to get a sense of what applications are needed and what could contribute to load on the servers.  This helps me determine whether the sizing guidance from the previously mentioned performance reports need to be judged higher or lower and how many servers to recommend.  Other uses are to determine if some optimization strategies are needed.

4.1.1

As you answer these questions, document them and revisit them periodically to determine if the original assumptions, that led to a given recommended topology and size, have changed and thus necessitate a change in the deployment architecture.  Validate them too with a cohesive monitoring strategy to determine if the environment usage is growing slower/faster than expected or detect if a server is nearing capacity.  Another good best practice is to create a suite of tests to establish a baseline of response times for common day to day scenarios from each primary location.  As you make changes in the environment, e.g. server hardware, memory or cores, software versions, network optimizations, etc., rerun the tests to check the effect of the changes.  How you construct the tests can be as simple as a manual run of a scenario and a tool to monitor and measure network activity (e.g. Firebug).  Alternatively, you can automate the tests using a performance testing tool.  Our performance testing team has begun to capture their practices and strategies in a series of articles starting with Creating a performance simulation for Rational Team Concert using Rational Performance Tester.

In closing, the kind of guidance I’ve talked about often comes out in the context of a larger discussion which looks at the technical deployment architecture in a more wholistic perspective, taking into account several of the non-functional requirements for a deployment.  This discussion is typically in the form of a Deployment Workshop and covers many of the deployment best practices captured on the Deployment wiki.  These non-functional requirements can impact your topology and deployment strategy.  Take advantage of the resources on the wiki or engage IBM to conduct one of these workshops.

Advertisements

Recommended practices when integrating Jenkins builds and RTC SCM

Many of our customers are using Jenkins for their software builds integrated with Rational Team Concert (RTC) software configuration management (SCM). This is a powerful combination but one that should be managed carefully so as not to put undue load on the RTC server. Based on several customer experiences my colleagues and I from development and services have put together an article that captures our guidance and recommendations when using RTC as an SCM system in Jenkins jobs. Check it out here: Using Rational Team Concert for source code management in Jenkins jobs.

Supported options for Reverse Proxies and Load Balancers in an IBM Engineering Lifecycle Management solution deployment

As a deployment architect when advising customers on their topology, I’m often asked about the supported options for reverse proxies and load balancing. Further, some customers ask about using DNS aliasing over a Reverse Proxy, which is supported but not a best practice.

While some of these details can be found in the Knowledge Center and deloyment wiki, we’ve not had a central article listing options with any comparisons. My colleagues in IBM Support have recently published an article with these details here: Reverse Proxies and Load Balancers in CLM Deployment.

Type System Manager Part 2

Ralph Schoon has done some very useful work providing some automation around our best practice guidance for DNG type system management. Check out his blog post to learn more.

rsjazz

We finally published Maintaining the Rational DOORS Next Generation type system in a configuration-management-enabled environment. Part 3: Automation tool deep dive on Jazz.net.This was a major effort and took a long time to do. This article provides a closer look at the source code, what it does and how it does it. It also provides some insight in how OSLC4J works and can be used. The information in the article, especially for setup and deployment of the automation prototype is very reusable for other scenarios and I hope to be able to reuse it in later articles and blog posts.

Type System Manager

When this effort was planned and performed last year, we had no idea what would come out of this effort. When we finished the first iterations and I started to write Maintaining the Rational DOORS Next Generation type system in a configuration-management-enabled environment. Part 3: Automation tool…

View original post 461 more words

How to register your custom utilities as a resource-intensive scenario

In Resource-intensive scenarios that can degrade CLM application performance I describe how certain IBM Collaborative Lifecycle Management (CLM) application scenarios can be resource-intensive and known to degrade system performance at times. As I’ve intereracted with customers on their deployments and performance concerns, it is apparent that they are getting more and more creative in building custom automation scripts/utilities using our APIs. At times, these custom utilities have generated significant load on the system.

As a best practice, we now recommend that customers evaluate their custom utilities and determine if any are candidates to be resource-intensive. For those that are, they should be modified and registered as resource-intensive with appropriate start and stop scenario markers included in the code. Until recently, all we could provide to help do this was some code snippets.

Thanks to my colleagues Ralph Schoon, Dinesh Kumar and Shubjit Naik, we now have documented guidance and sample code to help you do this. Have a look at Register Custom Scripts As a Resource Intensive Scenario. Ralph also gives some additional detail behind the motivation for the custom scenario registration in his blog post.

Once registered, you will now be able to track their occurrence in the appropriate application log. If you’ve implemented enterprise application monitoring, you can track for available JMX MBeans as described in CLM Monitoring.

Detecting mixed use of RTC SCM clients

The IBM Continuous Engineering (CE) solution supports the notion of N-1 client compatibility. This means a v5.0.x client such as Rational Team Concert (RTC) for Eclipse IDE can connect to a v6.0.x RTC server. Customer deployments may have 100s if not 1000s of user deployments with various clients. It is not typically feasible to have them all upgraded at the same time as when a server upgrade occurs. In some cases, to upgrade the client requires a corresponding upgrade in development tooling, e.g. version of Eclipse, but doing so is not possible yet for other reasons. 

In an environment when there is a mix of RTC clients (inclusive of build clients), a data format conversion on the server is required to ensure that the data from the older clients is made compatible with the format of the newer version managed by the RTC server. Reducing these conversions is one less thing for the server to do, especially when multiplied by 100s or 1000s of clients.  Additionally, new capabilities from more recent versions often can’t be fully realized until older clients are upgraded.

There are two ways to determine the extent of how much N-1 activity is ongoing in your environment.

ICounterContentService

Enable the RepoDebug service and go to https://<host:port/context>/service/com.ibm.team.repository.service.internal.counters.ICounterContentService. From there, scroll down to the Compatible client logins table.

The sample table above was taken from a 6.0.3 RTC server. It shows that out of 75685 client logins since the server was last started, only 782 of them were from the most recent client. 70% are from 5.x clients. Ideally, these would be upgraded to the latest client.

Compatible client logins MBean

If using an enterprise monitoring application integrated with our JMX MBeans, starting in 6.0.5, the Client Compatible Logins MBean can be used to get similar details as shown in the table from ICounterContentService. The advantage of the MBean is that the login values can be tracked over time so you can assess whether efforts to reduce use of old clients are succeeding

Detecting which users (or builds) are using old clients

In order to determine which users, build servers or other integrations are using older clients, you could parse the access log for your reverse proxy. In particular, find all occurrences of versionCompatibility?clientVersion in the log.

As you can see from the sample access log, there are several instances of calls with a 5.0.2 client. The log entries have the IP address of the end client sourcing the call. If this can be associated to a particular user (or build user), you can then have someone to talk to regarding upgrading their client version.

Detecting RTC SCM access not using Content Caching Proxy

One of our best practices for improved Rational Team Concert (RTC) Software Configuration Management (SCM) response times and reduced load on the RTC server is to use a content caching proxy server. These are located near users at servers in remote locations where the WAN performance is poor (high latency to RTC server). What is often missed, is that we also recommend they be placed near build servers, especially with significant continuous integration volume, to improve the repeated loading of source content for building.

This practice is not enforceable. That is, SCM and build client configurations must be manually setup to use the caching proxy. This is particularly troublesome for large remote user populations or where large numbers of build servers exist, especially when not centrally managed.

The question that naturally comes is how can one detect that a caching proxy is not being used when it should? One way is to look at active services and find service calls beginning with com.ibm.team.scm and com.ibm.team.filesystem for RTC SCM operations or com.ibm.team.build and com.ibm.rational.hudson for RTC build operations.

Since the IP address of the available caching proxies are static and known, you can find any entries on the Active Services page with a Service Name of any SCM or build related service calls that are not coming from an IP address (Scenario Id) belonging to a caching proxy. Since the active services entry captures the requesting user ID (Requested By), you can then check with the offending user to understand why the proxy wasn’t used and encourage them to correct their usage.

Active services detail is also available via the Active Services JMX MBean. If an an enterprise monitoring application is being used and integrated with our JMX MBeans, then it can be configured to capture this detail, parse it and generate appropriate alerts or lists to identify when a proxy is not being used.

One other option is to parse the access log for your reverse proxy.  Shown below is sample output from an IBM HTTP Server (IHS) access log.

The access log does not have user ID information but it does have the service calls and the IP address they are coming from.  You would need to have a way to determine associate an IP address with a user machine (for those entries not coming from a caching proxy).  Note that if a load balancer is used, the IP address recorded in the access log may not be the true IP address that originated the request.  For this reason and since the user ID information is not directly available, the Active Services method may be better.