Rally On-Premise Edition: What is My User Capacity? White Paper
Transcription
Rally On-Premise Edition: What is My User Capacity? White Paper
Rally On-Premise Edition: What is My User Capacity? White Paper By Jim Campbell, Rally On-Premise Edition Product Owner On-Premise will support your capacity needs. Rally has had more than enough capacity for our largest Fortune 100 customers. But it is important to evaluate your organization’s specific dynamics to make the call on how many users you can comfortably support in your hardware environment. Rally offers an On-Premise version of our Application Lifecycle Management (ALM) product. It is deployed as a copy of our SaaS stack, packaged in a VMware appliance. This virtual machine contains the same components as our operational stack: Oracle, Apache, and our application (app) servers. It is basically plug and play; install it in a VMware environment, bring it up, start adding users, and get to work. Currently, Rally has customers with more than 10,000 individual users, and many customers with well over 1000 users. We are often asked how many users the On-Premise appliance can support. Every customer has a different load profile on Rally, and one of the benefits of a SaaS deployment is that we can look at the differences in how customer usage impacts the Rally stack. In this whitepaper we’ll present some data on those variations to help you answer “how many users can I put on my Rally On-Premise instance?” because the answer really is “it depends.” www.rallydev.com ©2013 Rally Software Development 1 As the only Agile multi-tenant solution with thousands of customers worldwide, our SaaS stack has a rich set of logging and analysis tools to allow our Operations (Ops) team to monitor customer activity without accessing proprietary content. Our On-Premise Edition will deploy this same monitoring capability later this year, which will allow Rally administrators to see data such as TPS, Java virtual machine (JVM) memory usage, and the like. Looking at Actual Customer Data So how do we start to answer the question of capacity for your Rally On-Premise Edition? Since we don’t have access to actual On-Premise data, we started by looking at the load characteristics of a number of our largest SaaS customers, since we wanted organizational dynamics similar to our large On-Premise customers. We also focused on TPS as our limiting factor for capacity. This is a metric that we can easily measure on a per customer basis. Memory is another limiting factor, but we give access to the current memory utilization in the On-Premise control panel. We’ve had good success with customers simply increasing the memory on the VM once the utilization max approaches 90%. With our largest customers (many thousands of users), 12 GBytes has been more than adequate. Analyzing the numbers from our large SaaS customers First of all, there are two types of users. There are human users who interact by clicking on a screen, and there are “robot” users, which are connectors to other systems, customer-written data extractors, and so on. Different customers have different mixes of these two types of users. Some have almost no robot transactions; others have a large number of automated transactions. Here is a large eastern US based customer with several thousand users, with the hourly maximum TPS over two days. www.rallydev.com ©2013 Rally Software Development 2 Some observations: • The human transactions pretty much follow a typical business day. People use Rally more in the morning when the coffee is kicking in, take lunch off, and have another jump mid-afternoon. • When we drill down into the makeup of the customer’s robot transactions, we see that they are mostly driven by the Eclipse and Excel plugins, as well as the Subversion Source Code Management connector. Since these are “people-driven,” that explains the business day cycle. They also use our Ruby API, which accounts for the other transactions using custom integrations. • Their peak TPS is 6.5 TPS. Here is a much larger US west-coast based customer with a different set of transaction dynamics. www.rallydev.com ©2013 Rally Software Development 3 Some observations on this second customer: • We see a substantially higher load of robot TPS as a ratio of human TPS. Unlike the first customer, they don’t match the same shape as the human transactions. Their data shows that their use of connectors is much more oriented to custom integrations using the Rally API, and therefore is probably less “people” driven. The two matching spikes across the two days at 4am hint strongly at off hour custom integrations for data extraction. • We are seeing a peak at 11.5 TPS at 8am US Western, when there is a high load (4 TPS) of robot transactions. • When we did further research, we found that this customer has a larger distribution of users in other countries than the first customer. We can see some correlation with human TPS and timezones for this customer from the spikes at 2am. Looking across a number of other large customers, we found that their peak TPS was lower if they had users in more time zones (i.e. load was spread over the 24 hour clock). www.rallydev.com ©2013 Rally Software Development 4 Normalized TPS To allow “apples to apples comparison,” we need to normalize data against the subscription size. Below are the maximum TPS load graphs from the above two customers normalized to TPS per 1000 users. Customer 1 average: 1.3 (18% higher) Customer 2 average: 1.1 There is a small (18%) difference in the average of TPS over the course of 24 hours, but there is a 61% higher max TPS because of the great variability between the peak to average load of the first customer. Below is a representative sample of large customers and their max TPS during each hour. Each line is a separate customer, with a circle for each hourly TPS sample. Looking at this set of customers, we see a clear concentration between 2 and 4 TPS/1000 users. But note that some customers have an equal load www.rallydev.com ©2013 Rally Software Development 5 from robot transactions. If these were running at the same time as the peak human use for Customer 1, we could see a load as high as 7 TPS/1000. There is an important point to be made here: it is entirely possible to max out the capacity of a Rally instance by unleashing a set of poorly written robot connectors. Our Ops team monitors our SaaS stack for this and actively reaches out to customers running high robot TPS to help them fine tune their integrations to lower the peak loads. As an OnPrem customer, you’ll want to work with your users on building and using well behaved integrations. We can confidently say that our large SaaS customers average 2-4 TPS per 1000 users, and we believe that you will see similar results on the Rally On-Prem appliance. This number can be affected by integrations and timezone distribution of users. On Prem Capacity Now that we can estimate the max TPS your users will generate, what is the maximum TPS capacity of an On-Premise server? Again, “it depends.” This time it depends on the hardware you are running it on. Our On-Premise edition is shipped as a VMware appliance, to be installed in a VM infrastructure. We recommend a configuration (listed www.rallydev.com ©2013 Rally Software Development 6 here on our web page) of two CPU cores and 8-12 GB of memory. For our SaaS stack, we have a sophisticated load test that we run with every SaaS version before we release it. Using that same load test against the most recent On-Premise release, on the recommended hardware configuration, with a typical user data set, we measure a maximum of 17 TPS before we max out the CPUs. Our experience with our SaaS stack tells us that increasing CPU cores should improve capacity in a near-linear fashion, and we are seeing the same behavior with the On-Premise Edition. Doubling the CPU cores to 4 moves the maximum TPS up to 40. That is as high as we have currently tested, since that is more than enough for the largest customers and prospects to date. We see near-linear improvements as we add JVMs to our SaaS stack, so we have every expectation that we’ll see the same in our On-Premise edition when we need to handle larger subscriptions. Maximum Users So what does this mean as far as answering the question of how many users can an OnPremise instance support? Again, it depends on the dynamics of the specific customer load, but here are some guidelines: Normalized Transaction Load Max Users with Standard Configuration 2 CPU Cores, 8 GB Max Users with Large Configuration 4 CPU Cores, 16 GB 2 TPS/Thousand 7,500 17,000 3 TPS/Thousand 5,000 12,000 4 TPS/Thousand 3,700 8,500 We’ve seen that there is about a 2x variability across a set of large customers and their maximum load on Rally. An On-Premise installation won’t be able to initially judge what their expected normalized TPS will be, and therefore their max based on the number of users. But based on our experience, we know that it won’t be at the high end of 4 TPS in the initial period of use. We’re planning on releasing monitoring tools for Rally OnPremise in the future, and then customers will be able to watch these transaction rates and upgrade their hardware as needed. It is always appropriate to work with the authors of robot transaction integrations and either reduce their peak load, or move the traffic to low usage hours. We are confident that by adding more cores and memory, and looking at other hardware options such as SSDs, we have lots of runway for user configurations far larger than the numbers above. www.rallydev.com ©2013 Rally Software Development 7 If you have any other questions, please don’t hesitate to contact your sales rep, or your technical account manager. Enjoy your Rally On-Premise instance! Acknowledgements A number of people provided valuable input into this paper. Thanks to Marc Chipouras, Brian Dupras, Vikas Shivamurthy, Dara Warde and Ian Whitmore. www.rallydev.com ©2013 Rally Software Development 8