How we sped up a Postgresql procedure 50X

 

Introduction

  What I'm about to explain may seem rudimentary to a DBA but it was over looked by our distinguished engineer. And once you learn the inner workings it's a "duh" kinda moment where you look back and wonder why you didn't see it.

What is a Postgres procedure

  Let's start with the basics. There are postgresql functions and procedures. The major differences between the two is that functions returns a result while a procedure does not. In addition to the return differences, you Select a function but Call a procedure.

  Diving into the specifics, a function is almost like a function in any other programming language. It takes inputs and can manipulate it or use it in a sql statement. 

 

 Adding two numbers that are fed as inputs:

1
2
3
4
5
6
7
CREATE FUNCTION add_numbers (first_number integer, second_number integer) RETURNS integer 
LANGUAGE SQL IMMUTABLE
AS $$

  SELECT $1 + $2;

  $$;

 

Taking two inputs and inserting into the users table:

1
2
3
4
5
CREATE OR REPLACE PROCEDURE insert_into_user(username varchar(255) user_id integer)
Language SQL
AS $$
  INSERT INTO users VALUES(username, user_id)
$$;  


Why offset is horrible for large datasets. 

  Offset is exactly what it sounds like, it starts at the what is set as the offset.

1
SELECT * FROM artists LIMIT 5 OFFSET 2;

This selects the selects 7 rows but throws away the first 2. This isn't a problem on smaller datasets and can be  non-consequential but on larger datasets in the million of rows it's unsustainable and takes longer and longer times.

For example take this semi-pseudo code postgres function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
batch_size=100000
offset=batch_size
loop
insert into new_table SELECT * FROM artists where artist_id BETWEEN 1 and 30000000 limit batch_size OFFSET cur_limit;
get diagnostics rows_inserted_last = row_count
rows_inserterted = rows_inserted+rows_inserted_last 
cur_limit=cur_limit+rows_inserted
if rows_inserted_last < batch_size
return;
end loop
end


This selects 100k rows at a time looping until it returns rows less than the batch size then exits. The first several runs aren't that bad, but around the the 6th or 7th loop run you have several seconds added since it has to select select 700k rows and throw away the the first 600k rows. When running this query with nearly 100 million rows it took around 8 hours. 

 Optimizing the query

  The simplest thing that would make it worlds faster is to remove the offset. Easiest way to do that is increment the first number in BETWEEN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
batch_size=100000
offset=batch_size
lower_limit
loop
insert into new_table SELECT * FROM artists where artist_id BETWEEN lower_limit and 30000000 limit batch_size;
get diagnostics rows_inserted_last = row_count
rows_inserterted = rows_inserted+rows_inserted_last 

cur_limit=cur_limit+rows_inserted

lower_limit=lower_limit + batch_size

if rows_inserted_last < batch_size
return;
end loop
end

 

What if we took it a step further and removed the limit too? We can do that by setting the upper limit to lower_limit +batch size:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
batch_size=100000
offset=batch_size
lower_limit
loop
insert into new_table SELECT * FROM artists where artist_id BETWEEN lower_limit and upper_limit;
get diagnostics rows_inserted_last = row_count
rows_inserterted = rows_inserted+rows_inserted_last 

cur_limit=cur_limit+rows_inserted

lower_limit=lower_limit + batch_size
upper_limit=lower_limit+batch_size
if rows_inserted_last < batch_size
return;
end loop
end

Okay it's looking much better. But we're exiting if there is less rows returned the batch size. While at first glance this appears to be okay, what happens if a row is deleted? We should instead exit when we reach the max upper bound. 


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
batch_size=100000
max_size=30000000
offset=batch_size
lower_limit
loop
insert into new_table SELECT * FROM artists where artist_id BETWEEN lower_limit and upper_limit;
get diagnostics rows_inserted_last = row_count
rows_inserterted = rows_inserted+rows_inserted_last 

cur_limit=cur_limit+rows_inserted

lower_limit=lower_limit + batch_size
upper_limit=lower_limit+batch_size
if rows_inserted_last=max_size then
return;
end if
end loop
end   

Other Potential Solutions

  We can instead use the seek method which is a simple where clause combined with a fetch first:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
batch_size=100000
max_size=30000000
offset=batch_size
lower_limit
loop
insert into new_table SELECT * FROM artists where artist_id > lower_limit fetch batch_size next rows only;
get diagnostics rows_inserted_last = row_count
rows_inserterted = rows_inserted+rows_inserted_last 

cur_limit=cur_limit+rows_inserted

lower_limit=lower_limit + batch_size
if rows_inserted_last=max_size then
return;
end if
end loop
end

Comments

Popular posts from this blog

I just made my first options trade on Robinhood

Enhancing connectivity from AWS to my on-premise network