Optimizing Record Storage and Search in Modern Applications

by | Apr 20, 2022 | Dev Team

At Great Ideaz, we talk a lot about Create, Read, Update and Delete (CRUD) processing and how the trellispark framework eliminates the need for CRUD removing 90-95% of the code-behind from our web forms.

What we don’t talk about much is the Command and Query Responsibility Segregation (CQRS) model and why that makes our data agnostic framework approach so effective. What CQRS does is draw a line between the creation of data and its use.

Let’s think about the data that our applications create. For the most part it is Write Once, Read Many (WORM). Consider some examples:

  • A customer record is created when we get a new customer and may be updated a few times over its lifetime, say when a customer moves between different addresses.
  • An order record is created when the customer places an order and is then updated a few times as we process the order to completion.
  • A product is created when we have a new product to sell, it might get updated once a year as new versions of the product become available.
  • A price list is created at the start of each sales period and never modified.

These examples tell us a lot about how our applications create and maintain data. For the most part, records are created and then only rarely modified. They have long lifetimes where they may be accessed many times, for example:

  • Customer and Order records are accessed whenever we need a list of customers, or when we are finding/placing/fulfilling an order. They are typically searched/listed/read thousands of times for every update.
  • Product and Price records can be read millions of times for every update.

What this tells us is that we should take a little more time when we create/update data to make sure that we can then search and access the data more efficiently.

In trellispark we store data agnostically using XML data fragments using a document database approach inside a fully relational database. This means that we can add all sorts of extra functionality when we create data such as:

  • Row level access control
  • Row level version history
  • Row level data audit
  • Tagging rows with:
    • Milestones
    • Events
    • Alerts

This enables us to go well beyond the traditional CRUD capability with our forms builder without any significant loss of performance.

As the data is persisted to storage, we then take a fully optimised approach to how the data will be searched and retrieved.

We start by shredding the XML data fragment into a set of read only lookup tables at the record level. These lookup tables need not contain all the fields in the record, we only need to take any fields that will be searched/filtered or used to create relational joins to other data. This means that the lookup tables only contain columns that will be indexed for optimal query performance. If we need the full record, we first find the required record keys and pull the full XML data fragments as required. This means that we get the full performance of a relational database to access unstructured XML fragments. In practices this means that we can pull the data for lists at least as fast as a fully optimized relational database application and yet get all of the benefits of a document base.

We don’t stop there! Consider the TSQL inside most applications. We might need a list of orders for a customer – so I might be joining a Customer table with an Order table and an Order Line table. Perhaps I also need to get some Product data… We typically see lists being generated by pulling TSQL joining several tables. I have seen extreme cases of reporting TSQL where 15-20 tables were routinely being joined.

One approach to flattening out these multi-table joins is to create a lookup table that captures the results of the join. These lookup tables can be created and maintained on the fly using triggers as new records are created or refreshed at known cycles. Refresh cycles can be hourly, daily, or weekly depending on the rate of data change. For things that change infrequently, such as product catalogs or price lists, the refresh could occur when changes are published. In practice, the use of triggers to maintain lookup from real-time record updates may add a few milliseconds to the save. For a multi-million row scheduled refresh, the query may take seconds or minutes to refresh the lookup table. The appropriate choice is obviously dependent on data refresh rate and latency requirements.

The cost of creating these flattening lookup tables is relatively minor as data changes are rare compared to search/filter/read. Consider the cost of the save offset against the thousands of reads where a query that might take a few seconds against multiple tables is reduced to milliseconds using a flat lookup table.

The other thing to consider is layering these flat lookup tables into hierarchies to optimize creation and search times. I have seen these flat lookups structured three to four layers deep to get the best optimization for create/read.

In summary, when thinking about data there is no one size fits all. The best solutions are often hybrids that consider how to make best use of all of the available technologies to minimize the need for code and the efficiency of data creation and utilization. Its not just about the CRUD – we also need to consider CQRS and WORM.