I find myself facing a particular dilemma quite often.
Should I keep this rarely used index or drop it to enhance write performance?
If the index isn’t used at all, it’s no problem, just drop it, but if
something, sometimes is using the index, it might be necessary to keep it.
This something might be some script generating a monthly report or some rarely
used back office feature. Dropping the index would cause such applications
to take for ever due to seq_scan of the huge table.
In my case, our system processes millions of transactions, and the
write-performance will at some time become a bottle neck.
Dropping indexes would help me, but would backfire by causing problems a few
times a months when the indexes are necessary.
Now I am thinking, for these real-time write transactions, it would be awesome
if we commit at once, and post-pone the work of updating the rarely used
indexes until the point where some other read-query needs them. They could be
updated as soon as possible, but not during peak-time, allowing more important
real-time transactions to be prioritized.
Now you are thinking, this can’t be done, because the index is useless if it’s
not in sync, and that’s correct. But wait, we can fix that. Every time we need
to use the index, we simply process the buffer to get in sync before the index
is being used. It would probably take a second or two to process the buffer,
which would delay the monthly report batch job, but that’s totally OK as it’s
not a real-time feature and takes many seconds to complete anyway, but, would
again take forever without the index in place.
In total CPU time, we would not save anything. The amount of work would be the
same. But we would survive better during write-operation peak-times. We want
to maximize write throughput at peak-times, and do the updating of the rarely
used indexes later when the system load is lower.
I don’t know how/if this could be implemented in PostgreSQL, but the recently
added feature CREATE INDEX CONCURRENTLY is in a way similar to what I need,
it updates the index outside of a ACID transaction, and prevents anyone from
using the index until the index is marked as VALID.
Maybe a background process could handle the updating of indexes, like the
what we do with the vacuum daemon.
I can see there has been some discussion on this subject before on the mailing
lists:
“Scaling with lazy index updates”
http://archives.postgresql.org/pgsql-performance/2004-07/msg00114.php
But this was a very long time ago, the conditions may have changed since
2004.
ACID must be maintained, the index must not be used if not in sync, I only
want to delay the index updating work a bit, only be forced to get in sync
when necessary by a read query.
Thoughts, anyone?
In the test below, we can see how indexes affects insert speed.
joel@Joel-Jacobsons-MacBook-Pro ~ $ psql
psql (9.2beta1)
Type "help" for help.
joel=# CREATE TABLE Transactions (
joel(# TransactionID serial not null,
joel(# Amount numeric not null,
joel(# AccountID integer not null,
joel(# CustomerID integer not null,
joel(# ProductID integer not null,
joel(# Datestamp timestamptz not null default now(),
joel(# PRIMARY KEY (TransactionID)
joel(# -- No foreign keys in this example
joel(# );
NOTICE: CREATE TABLE will create implicit sequence "transactions_transactionid_seq" for serial column "transactions.transactionid"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "transactions_pkey" for table "transactions"
CREATE TABLE
joel=# -- No indexes except the primary key:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=454.232..454.232 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=14.762..131.197 rows=100000 loops=1)
Total runtime: 489.204 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_AccountID ON Transactions(AccountID);
CREATE INDEX
joel=#
joel=# -- 1 index:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=739.295..739.295 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=10.160..135.399 rows=100000 loops=1)
Total runtime: 741.141 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_CustomerID ON Transactions(CustomerID);
CREATE INDEX
joel=#
joel=# -- 2 indexes:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=1626.515..1626.515 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=9.739..530.770 rows=100000 loops=1)
Total runtime: 1627.900 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_ProductID ON Transactions(ProductID);
CREATE INDEX
joel=#
joel=# -- 3 indexes:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=2161.321..2161.321 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=9.976..549.794 rows=100000 loops=1)
Total runtime: 2164.164 ms
(3 rows)
