September 18, 2019

Thoughts on Arbitrary Pagination

Pagination is the act of breaking a data set into multiple pages to limit the amount of data that has to be processed and sent by a server at once. We’re going to be changing how pagination works on crates.io, and I wanted to share some musings about the issues with supporting this as a generic abstraction. While I’m going to be talking about some PostgreSQL internals in this article, the general ideas presented apply to any SQL database.

“Simple” Pagination #

The most common form of pagination uses page numbers. This is done by having numbered page links shown towards the bottom of the UI, and including ?page=5 in the URL. Often there are also links for next page, previous page, first page, and last page. The UI might look like this.

This form of pagination is implemented on the server by adding an offset to the query. The total is also needed to know how many pages there actually are. For example, the query to get the 5th page of crates might look like this:

SELECT *, COUNT(*) OVER () FROM (
  SELECT * FROM crates
  ORDER BY name ASC
) t
LIMIT 10
OFFSET 40

This is by far the most common form of pagination, but it has a lot of issues. The first is that it’s inconsistent. Any rows added or deleted to pages before the one you’re currently on will affect the results you get back, leading to skipped results or results being shown twice. However, the bigger issue for crates.io is that using OFFSET for pagination doesn’t scale.

The issue with `OFFSET` #

If you don’t care about the technical details of why OFFSET is slow, just know that OFFSET takes linear[^5] time. Discussion of alternatives and their implementation starts here.

Paginated data needs to have some consistent ordering. To make that ordering happen quickly, we will always want to have some index on the data being sorted. The default index type (and the only one that can be used for ordering[^1]) is B-Tree. A B-Tree is a generalization of a Binary Search Tree. The differences between them are mostly related to the costs of memory or disk access, which isn’t relevant for the discussion at hand. For the sake of simplicity, we’re going to pretend that these indexes used perfectly balanced binary search trees, as it doesn’t change the performance characteristics for the purposes of our discussion.

Balanced binary search trees and B-Trees are both very efficient for finding a specific value (or the first value that is greater/less than some specific value). From each node, we are able to get all values that are greater than or less than the value at that node. This means that finding all crates published after a given date takes at most log(total_crate_count) steps, as each step reduces the data we have to search by half.

But we’re not looking for some specific known value, we’re looking for the nth record in the set. A binary tree offers us no help here! In order to find the nth element, we have to traverse the whole tree starting from the lowest element, counting how many nodes we’ve seen until we get to the offset[^2]. So for the purposes of calculating OFFSET, the only benefit the index is giving us is starting from a pre-sorted list. However: for that benefit, actually accessing each row is slower[^3]. In fact, the difference is substantial enough that the cost of iterating over the index eventually becomes greater than the cost of just sorting the table and iterating over that. We can see this reflected by the query planner and execution times as we increase the offset[^4].

[local] sean@cargo_registry=# explain analyze select * from crates order by name asc limit 100 offset 0;

QUERY PLAN
--------------
 Limit  (cost=0.08..32.84 rows=100 width=1044) (actual time=0.056..0.761 rows=100 loops=1)
   ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..9475.16 rows=28922 width=1044) (actual time=0.055..0.751 rows=100 loops=1)
 Planning Time: 0.978 ms
 Execution Time: 0.785 ms
(4 rows)

[local] sean@cargo_registry=# explain analyze select * from crates order by name asc limit 100 offset 10000;

QUERY PLAN
-------------
 Limit  (cost=3276.14..3308.90 rows=100 width=1044) (actual time=20.237..20.377 rows=100 loops=1)
   ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..9475.08 rows=28922 width=1044) (actual time=0.010..20.018 rows=10100 loops=1)
 Planning Time: 0.571 ms
 Execution Time: 20.405 ms
(4 rows)

[local] sean@cargo_registry=# explain analyze select * from crates order by name asc limit 100 offset 18000;

QUERY PLAN
--------------
 Limit  (cost=4304.39..4304.44 rows=100 width=1044) (actual time=100.250..100.272 rows=100 loops=1)
   ->  Sort  (cost=4295.39..4309.85 rows=28922 width=1044) (actual time=97.291..99.714 rows=18100 loops=1)
         Sort Key: name
         Sort Method: quicksort  Memory: 29037kB
         ->  Seq Scan on crates  (cost=0.00..3866.77 rows=28922 width=1044) (actual time=0.009..14.576 rows=28922 loops=1)
 Planning Time: 0.121 ms
 Execution Time: 101.942 ms
(7 rows)

Even though this query is nearly instant when fetching early rows, we quickly see an increase in growth as OFFSET increases, and the query eventually becomes unacceptably slow for a high traffic endpoint. This isn’t going to be a problem for all data sets. Maybe the data you’re paginating over is small enough that you can offset to the end quickly. Maybe nobody ever visits page 1800. But if you’ve got an API that is routinely crawled (which might just be googlebot indexing your site), you’re likely to run into this problem sooner or later.

This can lead to some baffling graphs where the execution time of a query appears to be correlated with how frequently it’s run (top graph is executions per minute, bottom is average execution time – each bar represents 1 hour):

In this case the spikes in execution count are from the API being crawled, meaning we’re getting more requests where we have a high OFFSET, causing an increase in the average execution time.

Better Pagination #

Now that we’ve seen why traditional pagination is slow, let’s talk about what we can do instead. The technique we’re going to discuss has a lot of names. I’ve seen it called seek based pagination, cursor pagination, keyset pagination, and more. I’m going to refer to it as seek based pagination, as I think it’s the most appropriate name[^6].

Regardless of the name, the basic idea is the same. Instead of having numbered pages, we only have links for next and previous page. In some contexts (such as infinite scroll or “load more” pagination), we don’t even need a previous page link. The UI might look like this:

The basic idea behind seek based pagination is that instead of having numerical pages, your next/previous page links reference the first/last result in the set on the current page. So instead of “Next Page” going to ?page=5, it would instead go to ?after=643, where 643 is the ID of the last thing displayed on the page.

To implement this form of pagination, your query needs to change slightly. You still use LIMIT to make sure you are only retrieving one page of records. However, with seek pagination, instead of using OFFSET, you will filter the records to only include those after some known value. So if you were paginating by id, it’d be WHERE id > {last id on previous page}. For name it’d be WHERE name > {last name on previous page}.

Many implementations I’ve seen have this field be the actual value you’re sorting by, but I don’t think that’s necessary. You can substitute name > 'name of crate 1234' with name > (SELECT name FROM crates WHERE id = 1234) with virtually no change in performance characteristics, as long as the column you’re searching by is the primary key or has a unique index. I do think there are a few good use cases for taking input other than the primary key[^7], but in general taking the public identifier for the row as your input is the most simple and flexible solution.

If we only ever sorted by name, we’d be pretty much done here. Our abstractions over pagination mostly exist to calculate offset, and help grab the total from the query. Seek based pagination doesn’t need either of those things, so we could possibly remove that abstraction entirely. It would also be fairly simple to write an abstraction over “here’s a column, a table, and an id please paginate”. Either way, the implementation would be fairly simple. This is where most literature I’ve read on this subject ends.

Supporting More Complex Ordering #

The first issue comes from the fact that we sometimes sort by multiple columns. For example, on the crates search page, we always sort by “name exactly matches search query” before anything else, since we want to show any crates that are exactly what you searched for first. In addition to that, often we want to sort by values with duplicates (download count, relevance) – so we need to add an additional sort expression at the end to make sure our results are deterministic.

Most of the places we use pagination today actually have non-deterministic ordering, which we need to fix. It’s surprisingly easy to do on accident. Some databases will refuse to execute a query without deterministic ordering, but it’s surprisingly easy to accidentally get this wrong. With OFFSET, your backend will generally (but is not guaranteed to) do the right thing. But with seek pagination, if your ordering has duplicates, you might end up with a “Next Page” link that displays the same results as the previous one.

So we need something that can handle multiple columns. For crates.io, we use PostgreSQL. PG supports composite values, which we could use to support multiple columns without too much additional complexity. Our query to sort by all time downloads with a search term might look like this:

WHERE (
  canon_crate_name(name) = canon_crate_name('foo'),
  crates.downloads,
  id
) < (SELECT
    canon_crate_name(name) = canon_crate_name('foo'),
    crates.downloads,
    id
  FROM crates WHERE id = 1234 LIMIT 1
)
ORDER BY
  canon_crate_name(name) = canon_crate_name('foo') DESC,
  crates.downloads DESC,
  id DESC
LIMIT 100

This approach is more complex, but it still seems fairly reasonable both to implement a generic abstraction for, or to have by hand. But this still has problems. The first is that using a tuple for this means we have to use the same comparison operator for all our sort expressions. This can’t support order clauses which use multiple directions. So we can’t use row comparison for this, we’ll need to fall back to each condition individually. So our query will end up looking like:

WITH order_row AS (
  SELECT
    canon_crate_name(name) = canon_crate_name('foo') AS matches,
    downloads,
  FROM crates
  WHERE id = 1234
  LIMIT 1
)
SELECT ... FROM crates ...
WHERE
  (canon_crate_name(name) = canon_crate_name('foo')) <
  (SELECT matches FROM order_row LIMIT 1)
OR (
    (canon_crate_name(name) = canon_crate_name('foo')) =
    (SELECT matches FROM order_row)
  AND
    downloads < (SELECT downloads FROM order_row)
) OR (
    (canon_crate_name(name) = canon_crate_name('foo')) =
    (SELECT matches FROM order_row)
  AND
    downloads = (SELECT downloads FROM order_row)
  AND
    id > 1234
)
ORDER BY
  canon_crate_name(name) = canon_crate_name('foo') DESC,
  downloads DESC,
  id ASC
LIMIT 100

I’ve swapped the order for id for the sake of example. Now we can support arbitrary directions, since each part of the where clause is written out explicitly. With something this complex, we must be done, right? Unfortunately even this still is not yet enough.

Don’t Forget About Null #

Every example I’ve given so far will fall apart as soon as NULL is involved. Even in articles I’ve seen that talk about multi-column ordering assume that no value is NULL. And this completely throws a wrench into what we’ve written so far. The issue isn’t even that the ordering will be wrong. Records with NULL in one of the sort expressions won’t even show up.

In SQL, NULL behaves much like NaN when dealing with floats. Passing NULL to either side of any operator other than OR returns NULL. anything = NULL is NULL. anything > NULL is NULL. anything AND NULL is NULL. Since NULL is considered falsey when filtering, any record with NULL in any column considered by the order clause will not get returned.

This gets even more complicated if you want to support any order clause, which means dealing with NULLS FIRST and NULLS LAST (for reference, ASC is equivalent to ASC NULLS LAST, and DESC is equivalent to DESC NULLS FIRST)

Luckily PostgreSQL has some operators that will help us deal with at least a bit of this complexity. We can replace all uses of = with IS NOT DISTINCT FROM, which behaves exactly like = except it doesn’t propagate NULL. For comparison operators, we’ll need to manually check for NULL to get the correct behavior, since 1 > NULL IS NOT DISTINCT FROM NULL > 1

Direction	Nulls	Operator
`ASC`	`FIRST`	`x IS NOT NULL AND x > y IS NOT FALSE`
`ASC`	`LAST`	`x IS NULL OR x > y`
`DESC`	`FIRST`	`x IS NOT NULL AND x < y IS NOT FALSE`
`DESC`	`LAST`	`x IS NULL OR x < y`

So our final, actually working query would look something like:

WITH order_row AS (
  SELECT
    canon_crate_name(name) = canon_crate_name('foo') AS matches,
    downloads,
  FROM crates
  WHERE id = 1234
  LIMIT 1
)
SELECT ... FROM crates ...
WHERE
  (canon_crate_name(name) = canon_crate_name('foo')) <
  (SELECT matches FROM order_row LIMIT 1)
OR (
    (canon_crate_name(name) = canon_crate_name('foo')) =
    (SELECT matches FROM order_row)
  AND (
      downloads IS NOT NULL
    AND
      downloads < (SELECT downloads FROM order_row) IS NOT FALSE
  )
) OR (
    (canon_crate_name(name) = canon_crate_name('foo')) =
    (SELECT matches FROM order_row)
  AND
    downloads IS NOT DISTINCT FROM (SELECT downloads FROM order_row)
  AND
    id > 1234
)
ORDER BY
  canon_crate_name(name) = canon_crate_name('foo') DESC,
  downloads DESC,
  id ASC
LIMIT 100

Of course that query was written manually, so I can omit null checks for columns I know are not null, inline values that I know can be inlined, and use a CTE for readability. However, the goal here is to generate this from an abstraction which handles arbitrary where clauses. So the actual query is a good bit more verbose. That query, while much harder to read, will have the same performance characteristics[^8].

{This is no longer in the realm of something you can reasonably write by hand. A generic abstraction is 100% needed, especially with the branching/interacting ordering of crates/search}

Could This Be Easier? #

I was really surprised that I ended up needing to write this by hand. What I really want is a function that takes some arbitrary ORDER BY clause, and returns a value that implements comparison operators accordingly. What I really want to write here is:

WHERE ORDERING(
  canon_crate_name(name) = canon_crate_name('foo') DESC,
  downloads DESC,
  id ASC
) > (SELECT ORDERING(
    canon_crate_name(name) = canon_crate_name('foo') DESC,
    downloads DESC,
    id ASC
  ) FROM crates WHERE id = 1234 LIMIT 1)

This is the sort of feature that the authors of PostgreSQL seem to always think of before I know I need it, so I was surprised to see something similar didn’t already exist. Unfortunately this would be hard to implement without native support, since proper index usage is critical here[^9]

A big part of the issue here is that there’s very little tooling for this form of pagination. Very few pagination libraries even support it at all. If they do, it often only handles single column ordering or treats it as an afterthought. As time goes on, the need for more scalable pagination is going to become more common, and this needs better support across the board. Arguably, it should even be the default.

Thankfully with Diesel, an implementation of this isn’t too unreasonable. Since everything is represented in the type system, we just need an order_by function that takes one of Asc, Desc, NullsFirst, or NullsLast (for simplicity of implementation, this will probably require explicitly calling .asc() instead of passing a column directly). There’s still room for logic errors to sneak in, but we can lean on the type system quite a bit to help keep the complexity manageable.

If you start to look closely you might see that a lot of folks who serve large data sets and use numbered pagination… Actually don’t. For example, Google still has numbered pages. But they’ll cut you off long before you get to the point where there would be performance issues (they only serve 1k records total per query).

Seek based pagination isn’t without its drawbacks. One thing that is much harder with this form of pagination is detecting the first/last page. You could detect if you’re on the last page by loading N+1 records. Detecting the first page could be done similarly, but I don’t think either of those are worth the complexity. If you got any pagination parameters at all, you can assume you’re not on the first page. If you got per_page records back, you’re not on the last page. This means that if your total number of records is divisible by your page size, you’ll have an empty last page. And if someone clicks “Previous Page” from page 2, they might have a clickable “Previous Page” link that goes nowhere (but really do you even need “Previous Page”? Browsers have a back button). Neither of these come up in practice for human users often enough to be concerning. Bots aren’t crawling backwards, and they can deal with an empty last page.

If you’re paginating data in 2019, consider whether you really need numbered page links. Are they actually useful? There are absolutely some use cases where they’re necessary, but I think these are mostly niche and typically for internal systems that can afford the performance tradeoffs. Often times when folks go to numbered pages directly, they’re doing it as a proxy for “seek X% in the data” which is something that we can provide better UI for, and implement efficiently in ways other than OFFSET.

[^1]: Technically BRiN indexes support the operators that would be used for ordering, but they aren’t useful for ORDER BY clauses. These indexes cause each heap page to include the max/min value for that page, so some pages can be skipped when filtering. This means they will only be used when those operators appear directly in a WHERE clause. Though a BRiN index combined with the methods described in this post would create an interesting scenario where fetching the last page of a result set is nearly instant, but the first page is O(n log(n)) over the size of the table.
[^2]: You might think we could work around this by having each node store how many children it has. But that’s not a question that can be answered usefully for a database, since different transactions have different answers to that question. It’s not just a question of traversal, we also have to check if each node is actually visible to our transaction or not. For more details, see why counting in PostgreSQL is slow.
[^3]: The details of how much slower this is depends on your shared buffer cache size, and the storage hardware you’re using. If you’re using a traditional disk drive, random access of rows potentially means many extra disk seeks, which is the absolute slowest way to access data on the same physical machine. If you’re using a solid state drive, random page access is going to be the same cost as sequential page access, assuming you would have read both from disk otherwise (which is not necessarily true).
[^4]: Depending on your local configuration and what type of storage you’re using, the exact point at which the query planner stops using the index may not be the actual point that the query becomes faster from a seq scan. If you’re using an SSD, setting random_page_cost to 1.0 will make it much more accurate in that prediction. Even if you’re using an SSD though, you will still see a linear increase in time based on the value of OFFSET, and there is still eventually going to be a point where it’s faster to not use the index at all. If you’re at the point where OFFSET is adding tens of milliseconds to your query, it’s time to look at alternatives regardless of whether your index can be used or not.
[^5]: It’s arguably super-linear, but Big-O isn’t really helpful when the changes in performance characteristics are based on in-memory vs on-disk vs disk seeking. Just know that traversing an index linearly is much more expensive than traversing a table linearly, and how much more expensive is based on configuration and hardware.
[^6]: Keyset based pagination implies that this is being used only when ordering by the primary key. Which is perhaps a very common case, but not relevant to the strategy as a whole, nor does that situation apply to me. Cursor based pagination… Well cursors are a thing in SQL, and if you’re writing a desktop application where you have a direct DB connection, it’s probably the best way to do pagination. But using cursors for this requires the same database connection be used for all subsequent requests, which doesn’t really work for web apps being served over HTTP. Unfortunately cursor is a bit of an overloaded term in this case
[^7]: For example, an API endpoint meant to give the most recent records/changes would benefit from being able to provide a timestamp instead of an ID, so clients can tell you the last time they asked without having to track the last ID they got back.
[^8]: Technically it’s marginally slower, since each of those subselects is going to be execute separately. Keep in mind that there’s no round trip time here though, and the actual execution time of “select 1 row by its primary key” is around 10 microseconds. So the query is roughly .05 ms slower than the CTE version, and will scale roughly linearly with the number of joins in the from clause. That seems acceptable.
[^9]: Yes, I’m aware that the first piece of the order by clause I’ve used in every example will prevent index usage. It’s not relevant to the point I’m making.

140

Kudos

140

Kudos

Thoughts on Arbitrary Pagination

“Simple” Pagination #

The issue with `OFFSET` #

Better Pagination #

Supporting More Complex Ordering #

Don’t Forget About Null #

Could This Be Easier? #

Now read this

Moving on from Rails and what’s next

Thoughts on Arbitrary Pagination

“Simple” Pagination #

The issue with OFFSET #

Better Pagination #

Supporting More Complex Ordering #

Don’t Forget About Null #

Could This Be Easier? #

Now read this

Moving on from Rails and what’s next

The issue with `OFFSET` #