How we sped up a Postgresql procedure 50X
What I'm about to explain may seem rudimentary to a DBA but it was over looked by our distinguished engineer. And once you learn the inner workings it's a "duh" kinda moment where you look back and wonder why you didn't see it.
What is a Postgres procedure
Let's start with the basics. There are postgresql functions and procedures. The major differences between the two is that functions returns a result while a procedure does not. In addition to the return differences, you Select a function but Call a procedure.
Diving into the specifics, a function is almost like a function in any other programming language. It takes inputs and can manipulate it or use it in a sql statement.
Adding two numbers that are fed as inputs:
Taking two inputs and inserting into the users table:
Why offset is horrible for large datasets.
Offset is exactly what it sounds like, it starts at the what is set as the offset.
This selects the selects 7 rows but throws away the first 2. This isn't a problem on smaller datasets and can be non-consequential but on larger datasets in the million of rows it's unsustainable and takes longer and longer times.
For example take this semi-pseudo code postgres function:
This selects 100k rows at a time looping until it returns rows less than the batch size then exits. The first several runs aren't that bad, but around the the 6th or 7th loop run you have several seconds added since it has to select select 700k rows and throw away the the first 600k rows. When running this query with nearly 100 million rows it took around 8 hours.
Optimizing the query
The simplest thing that would make it worlds faster is to remove the offset. Easiest way to do that is increment the first number in BETWEEN
What if we took it a step further and removed the limit too? We can do that by setting the upper limit to lower_limit +batch size:
Okay it's looking much better. But we're exiting if there is less rows returned the batch size. While at first glance this appears to be okay, what happens if a row is deleted? We should instead exit when we reach the max upper bound.
Other Potential Solutions
We can instead use the seek method which is a simple where clause combined with a fetch first: