Our Software Code and MongoDB Performance Postmortem

11 min readNov 21, 2022

At Soteria Software, we just did some massive updates specifically around performance of our OpenRMF Professional application. Now that we are past the scalability and large dataset issues, we decided to do a postmortem just in case this will help other teams get past this. Or help you avoid this entirely from the get-go.

This is a technical postmortem where we go over a few areas we updated in our code, database and processing. Hopefully you and your team can read along for the story and explanation and see if any of this helps you.

Team Meeting for software postmortem and spreading knowledge

TL;DR

Our application uses a combination of MongoDB Community in a few database containers with stored volumes, as well as .NET Core 6 API and messaging components using NATS, the MongoDB C# client, and a web frontend with DataTables as a primary source to call and list data via APIs. The application did well for small or medium sized groups of users tracking their RMF and FedRAMP packages, automating around their data, and tracking trends of vulnerability and compliance. It worked well handling 100’s of data points on checklists, vulnerabilities patches, hardware, software listing and generating compliance from all that.

The challenge appeared when we received attention from larger groups and agencies that had 1,000’s of devices and 10’s of thousands of checklists and other data they needed to show. They loved the automation and the time/money saved. However, they had a LOT of data to show and it was getting slower the more they added. Some of these generated listing of compliance records of 250,000 items or more. The speed issue (or lack thereof) showed more and more as we scaled out their data and ingested their infrastructure scans over and over. We had a very bad inverse relationship to speed and delivery of the data. And it made our great solution very un-friendly to use for these larger groups. That was a problem we had to fix!

To be fair, we knew this was coming in the Spring 2022 based on initial conversations and testing and marked down in our tickets and Trello board what had to be investigated and fixed. We started in early Summer 2022 and are set to release our updates shortly this November 2022. And we had to even get to this point of people even using our solution for this to be a problem. So what we did to get here worked well. It just had to be revamped and/or reengineered.

What was happening, why, and what we did to fix it is below. We will continue the performance journey and scalability testing going forward. We are at a point now though to share what we found just in case others can benefit.

Problem and Areas of Focus

We had a few areas in particular where we saw issues as we scaled out the data to very large lists that we were processing and displaying. One area was on listing the checklists for an ATO or system package when it was in the 10,000 or more numbers. That got slower and slower the more data we added. The second area was when we uploaded a checklists, patch scan, or other vulnerability scan and we had our POAM live. The POAM slowed us down immensely when adding/updating the POAM listing attached to the scanned data uploaded.

The third area that eventually just timed out for us when working with very large lists of data was our compliance engine. With even 1,000 checklists in the listing, clicking “generate compliance” meant opening up each checklist, each vulnerability in each checklist, and matching each control correlation identifier (CCI) to the required controls/subcontrols. Then tracking the overall status of those per control and then per family. And saving all these records and controls/subcontrols with status per checklist vulnerability list. That process is correct. The implementation was not. Trying to do all that with a large number of checklists on an API call with immediate results to wait for and return kept timing out.

Those were the areas we had to fix. There was some overlap across the issues as we investigated more and more, all pointing back to the size of the data and how we were managing and processing it on the frontend and backend. So we focused on tracking larger lists, how we could make the add/update faster, and how we could use the UI to better call the data with pagination (as we were not using that correctly yet!). The results of how we fixed these issues and more are spelled out below. Per usual, this is one way to do it and it works. Your results and implementations may be different. There are always n+1 solutions to any design situation!

Fixing the Database Backend

On the database side, we only had to do 2 things. And these actions were fairly easy once we figured them out. Getting to the resolution on what to do took a little bit of time to study and test.

The first was to make a MongoDB Collection of combined data in an eventual consistency way. This is more of a CQRS type of setup, where we query this combination of data built in in a separate collection. The add/update/delete of the data is in the primary area and then triggered to update the new consolidated read-only listing we query. That made our main listing of checklists with their overall score (open vulnerability numbers) massively faster, even with searching and filtering the data. Funny enough, it also made 3 other areas faster for test plan summary, reporting and exporting data to MS Excel. That was an exciting bonus!

The second was investigating the indexes used for the query and filters. We had several collections with a few indexes in the right spots. However, we had a few database calls that we were filtering on equals, greater than, and less than data comparisons with no index or helpers to actually do this quickly. So we updated a few and tested along the way to ensure we had faster queries happening on the backend.

Of course, too many indexes and THAT can slow you down as well. It is a balance that we continue to monitor and test.

Fixing the API calls and Business Code

On the API side and business logic processing we had several things to fix. They depended on the database and definitely impacted the UI as well as our external API as well.

First things first: we had a few places where we were doing individual Add and Update calls repeatedly to the database in sequence when we had a list of data being processed! The Delete calls were done in bulk already…which made us wonder why we did not do the Add and Update in the same manner. So we fixed that with the “InsertManyAsync” and “UpdateManyAsync” type of calls, passing it a list of data types. That sped up the ingest of data, generating of POAM, and generating of compliance immediately. So much so we did not believe it was working and did many manual database checks using Mongo Express to ensure the data was being added and updated.

We also had several areas where we could accumulate very, very large lists we needed to query and display back to the users. And we were sending back the whole list to show in our DataTables on certain pages. For the fix here, in those calls we implemented pagination calls at the database level. Of course that meant the UI had to change to call them correctly (see our UI section fixes below) and our external API interaction had to be updated with page and limit type of references. That way we only send back data requested with our API as well.

At the same time, we had to do the type of calls shown below at the database level versus in our business code logic to make sure we were only pulling back the data we required to show. So to fix this, it was a business logic, database call, and UI fix all in unison.

query.Skip((page)*pagesize).Take(pagesize).ToListAsync()

In addition to pagination, some of our calls took 2 separate database tables/collections and combined data on the fly at runtime. And I mean THE WHOLE LISTINGS. And then filtered on them…which is a very bad use of resources. Here, we added a new combination collection that was filled with eventual consistency and added/updated triggered based on certain events and commands published on our NATS backend messaging communication path. This is the CQRS thing mentioned above in the Database section.

And we passed the filtering to the database to make the database do its job well. The idea here with pagination, combining data, and a better UI call made a massive difference in performance and in how we handled very large datasets.

Finally, we had to deal with all the pulling back of data, filtering, and processing in memory causing timeouts or out-of-memory (OOM) errors for generating compliance across 10,000+ checklists. The process was correct that we were doing via automation. The implementation was not as it would time out on large listings. So we did a few things here with the knowledge gained from the pagination and bulk add/update code fixes in other places:

made the compliance generation asynchronous and now we perform that behind the scenes with a publish/subscribe event
our backend code processes checklists 50 at a time now, generates compliance for them in smaller chunks and saves in bulk here as well
generate a notification message when done to show we completed the calls
generate the “score” of the compliance at the control/subcontrol level to make display faster and easier

All of this being asynchronous not only made the UI faster to display data and track correctly. It let us display data even in the middle of generating the compliance. It let us track the start/end to show the processing time. And a final bonus here: it let us add an external API call to generate compliance via API not just with the UI.

Fixing the Frontend UI API Calls

On the UI end of things, for some of our listings we started to pull back 100,000+ records and noticed a few things:

slow response on loading that got worse and worse over time with more data
if there was going to be a LOT of data, slow response time sending the large data back to the browser to display even with GZIP enabled on our NGINX reverse proxy
Chrome/FF/Edge would crash or just not show data with 500MB JSON worth of data or more

So the question of WHY we are sending all that data came up. The answer was so the DataTable could pull it all in, sort, page, and filter the data for the users on the fly. The challenge comes with how we were using DataTables and calling the data. We were using AJAX in most places and just using the data straight which is a good way to do this. However, in some areas of pages we were going through Javascript row by row and adding the data to the DataTable with a “table.row.add(xxxx)” type of call. That is very bad for large datasets.

First thing we fixed was the “one row built and added at a time” and put code into the “render”: area of the DataTable code for displaying tables. That sped up rendering quite a bit for sure. For very large lists, now the problem was the amount of data being sent back and processed by the UI. So fixing JUST that “one row added at a time” did not help 100%. So we turned to the DataTables server-side processing examples and method to make large lists work well. That was the direction that worked best.

Of course, that meant we make the AJAX call from our UI send all data we needed for our API. And then the DataTables code would add on the draw, start, length, columns, and other data it needed to send/receive for proper pagination and rendering. We needed to read all that as well on our backend API components and business code (mentioned above), and then generate the data going back to the DataTable. This changed our data model being sent back as it must have all the required references as well as the list of data in the “data”: [ ] area of the JSON returned (see the DataTables links above). And generate only the data that was currently being displayed and required, as well as the “count” of the total data and “count” of the filtered data, if any filtering was happening.

This UI update was done in several places that have the potential for very large datasets right now (checklist listing, POAM, reports). And the results were spectacular for sure! Over time as we track usage and performance, we will probably update a few other areas with the same methods and model of data display and searching where required. Because it proved to work well.

End Result and Going Forward

The combination of all this frontend work and the API fixes as well as pagination calls done inside the database made our application scream for small data, medium data (15,000 items in a listing) as well as very large lists of data (several hundred thousand or more items)! It was not one thing, it was all of the ideas and fixes mentioned together in combination.

In some places a 4 minute load of data was reduced to 2 seconds or less! Because we searched with indexes, on a combined collection of data, with pagination, and only sent back what we needed. And this opens up the use of our application now with small, medium, large, or enterprise teams that want to automate their cyber compliance data and processes.

The compliance generation being a backend asynchronous processes, mixed with pagination and grouping chunks of data at a time with a bulk add database call made that process actually work correctly with larger lists. It was not even able to process the data when we got to the 1,000 checklist amount.

This compliance generation is a big feature for our OpenRMF Professional application. This is a VERY manual process when left for people to do themselves based on data and checklists, and track at that level of complexity and detail. What can take weeks of multiple people in teams doing this process, we reduce to minutes with the same (or better) results. And with our solution, these teams can do that repeatedly over and over regardless of the team size.

Getting this correct was imperative, and all the work done on the other database, API component, and frontend fixes aided us in getting this redesigned to work correctly. The massive amount of data generated can now display correctly, using the UI server-side processing and pagination updates as well. Combined with the work we did to generate and record the score of compliance, this area of our application now loads in seconds compared to minutes or not-at-all. Another great result from our team!

All in all, this work made us revisit key areas of the frontend, backend and database design and ensure we could handle small, medium, large, and massive amounts of data the same way consistently for our users. Without changing our whole technology stack. And we can talk about it now with confidence going through this process.

We also have a method and process to fix these types of issues in other areas if we see them pop up. And we will design new features and other components that work with OpenRMF Professional with this in mind up front now to do it well from the start.

We did not do this alone

Of course, we did not do this alone. Good strong tech people combining brainpower on a problem like this works best. A special thanks here to our value added reseller at Team OMNI and their tech staff for helping us pinpoint the issue, track the response time areas needing work and brainstorming possible solutions. They have a good group of smart people that we were able to work with, track this issue and talk out ideas and ways forward to get a good result for all customers, including theirs!