Break out of the cohort building triangle with caching!

Behold – a triangle! (Pythagoras would approve):

As the requirements of building a cohort grow so does the complexity. Complexity can come from querying more datasets, more columns or just larger volumes of data. Complex queries take longer to run. You can reduce runtime with power: more CPUs, Faster Disks, Larger indexes etc.

The goal of the RDMP Cohort Builder is to smash (or at least fracture) this triangle by:

We have already seen how we can reduce complexity by splitting the query. Now lets look at how RDMP respects your runtime investments with caching.

If a part of your cohort identification configuration takes 2 hours to run then you should only ever have to do that once! If another data analyst wants to check the results or modify a separate part of the configuration (add more complexity) then that shouldn’t mean re-running everything.

Sounds like magic right? (Pythagoras would not approve). But for this to work we need to make a few assumptions:

So now we should be able to see how the pieces of the jigsaw fit (could you make a jigsaw with triangular pieces? – Ed). RDMP splits the query into each separate dataset:


For each data set queried, the SQL is built and the query run against the database in parallel. Since each query targets only a single dataset the complexity is low and the DBMS query planner can easily make an optimal plan to return the list of patients.

RDMP records in the cache:

Since only a single column is returned by each query (the patient identifiers) the cached table can have a primary key which makes the subsequent container totals incredibly fast to run (zero complexity).

Changes


If a user changes part of the query (e.g. changing the age exclusion to under 60) then the execution engine will regenerate the SQL for every dataset. If it matches the recorded cached SQL then the cache table is used otherwise the cache is invalidated and the query run.

About those assumptions...

Lets start this paragraph with a quote from our favourite ancient Ionian Greek (You googled that didn’t you – Ed).

The oldest, shortest words— “yes” and “no”— are those which require the most thought.
― Pythagoras

 

So:

Can I still query two datasets at once (e.g. hospitalised after being prescribed drug x)
Yes (but that’s a blog for another day)

Can I clear the cache once I load new data
Yes!

I have more questions!
Yes? There are more answers available in the FAQ or post a question in the RDMP GitHub Issues.