Blog post

Postgres Full Text Search vs the rest

2022-10-14

14 minute read

Postgres Full Text Search vs the rest

Postgres is one of the best F/OSS databases partly because of its immense feature-set.

One of my favorite Postgres features is Full Text Search (FTS). Search is a common requirement of applications. Well-known search engines like Solr and ElasticSearch are often a first choice, but with Postgres in your stack you've got a great chance for Pareto improvement at low complexity cost.

Many projects function just fine with Postgres Full Text Search and other built-in extensions like trigram search (pg_trgm). GitLab's blog has a great article on their use of Trigram indices to speed up search.

Search Engines: The Next Generation

These days, the debate isn't just Postgres versus Solr or ElasticSearch, a new generation of F/OSS search engines has arrived. To name a few:

Well OK, some of these engines aren't quite new; SQLite is stable and has supported the search use-case for a while and OpenSearch is a fork of ElasticSearch. While not new, both are worth comparing as choices in front of engineers today (spoiler alert: we're going to compare them).

These search engines are all impressive, but it's hard to evaluate these options in a VACUUM.

If Postgres is already your main database, then it's probably easier to integrate than anything else. But would it be better to take on more complexity by choosing a dedicated search engine?

While search is an endlessly complicated problem and all of these tools are configurable, a comparison of minimal setup and tuning should yield some interesting insights.

Before we get into it though, a brief primer on what "Full Text Search" means, in Postgres land.

A whirlwind tour through FTS in Postgres

If we want to compare other solutions to Postgres FTS, we'd better review how to set up and use Postgres FTS!

Wrangling your existing data

Imagine you have some information to search through, stored in your main database.

Maybe a movies table with a structure like this:


_10
create table movies (
_10
id bigint primary key generated by default as identity,
_10
title text not null,
_10
original_title text not null,
_10
overview text not null,
_10
created_at timestamptz not null default now()
_10
);

While you're inserting data you might perform INSERTs like this one:


_11
insert into movies (
_11
title,
_11
original_title,
_11
overview
_11
)
_11
values
_11
(
_11
"Avengers: Age of Ultron",
_11
"Avengers: Age of Ultron",
_11
"When Tony Stark tries to jumpstart a dormant peacekeeping program, things go awry and Earth’s Mightiest Heroes are put to the ultimate test as the fate of the planet hangs in the balance. As the villainous Ultron emerges, it is up to The Avengers to stop him from enacting his terrible plans, and soon uneasy alliances and unexpected action pave the way for an epic and unique global adventure.",
_11
);

Basic searching with select and like

With nothing but the raw data in your database, we can actually do some basic searching using select and comparison tools like like.


_10
select
_10
*
_10
from movies
_10
where title like '%Avengers%' or overview like '%Avengers%';

The LIKE operator as used in the query above will match only the term "Avengers", but it will match it in any part of a given row's title or overview.

We can hack around this basic functionality and make it a bit more robust:

Our "v0" would be better, but using like isn't quite what people mean when they ask for robust full text search. Not a bad first version, but Postgres can do much better.

Robust Full Text Search with Postgres

Postgres has built-in Full Text Search which we can use instead of cobbling together our own with like.

Luckily for us, the Postgres FTS documentation is quite extensive, containing lots of information on the all-important Postgres search data types: tsvector and tsquery.

State of the art search engines like Solr do things to make searches work well:

  • Stemming ("jumped", "jumper" -> "jump")
  • Lemmatization ("better" -> "good")
  • Result Ranking configuration
  • Search by edit distance
  • Vector based similarity (which you might use for a basic recommendation system)

Postgres FTS does not do all of the things a purpose-built search engine can do, but Postgres can produce very good results with decent performance.

A working Postgres FTS setup

Integrating Full Text Search into our imaginary movies table is as simple as executing the following:


_12
-- Add a generated column that contains the search document
_12
alter table movies
_12
add column fts_doc_en
_12
generated always as to_tsvector (
_12
'english', title || ' ' || original_title || ' ' || overview
_12
)
_12
stored;
_12
_12
-- Create a GIN index to make our searches faster
_12
create index movies_fts_doc_en_idx
_12
on movies
_12
using gin (fts_doc_en);

Postgres supports even more features like facilities for ranking search results but for now we'll simply use title, original_title, and overview as they are.


_10
select
_10
* from movies
_10
where doc_en @@ websearch_to_tsquery('english', 'Avengers');

Comparable FTS engines

Now that we've got a grasp on what FTS is and how it's used in Postgres, let's discuss a few of the alternative solutions Postgres can be compared to.

As you might expect, this isn't an exhaustive list of FTS engines, there are many more that we're not covering here (If you've got another search engine you'd like to see featured, file an issue).

SQLite FTS

SQLite is arguably the most widely deployed database in the world. SQLite is a library that produces and manages single file (or in-memory) databases that run on phones, planes, laptops, kiosks, Raspberry Pis, and everything in between.

Much like Postgres, SQLite is an excellent F/OSS project that has adopted the ability to perform Full Text Search.

The SQLite FTS documentation is extensive, and the FTS subsystems have undergone a few version upgrades -- FTS 3/4 and SQLite FTS 5 are both options.

For our experiments we'll use SQLite FTS5.

TypeSense

TypeSense is a relatively new project that focuses on lightning fast, typo-tolerant open source search. Typesense boasts a high performance C++ codebase, promising nearly instant search results.

TypeSense provides many demos with example datasets available for query, which you can try out:

Typesense quite ambitiously brands itself as the "Algolia" and "Elasticsearch" alternative, offering to solve your search problems without requiring a PhD.

MeiliSearch

MeiliSearch is a new search engine written from the ground up in Rust, which also promises lightning fast searches, flexibility, and typo tolerance.

MeiliSearch has extensive documentation, boasts an easy to use yet extendable architecture, and is well supported by a large community and team.

OpenSearch

AWS OpenSearch is AWS's answer to ElasticSearch's license changes. While opinions are varied, ElasticSearch built a great piece (formerly Apache 2.0 licensed) piece of software, and the AWS fork represents at least a stale measure of ElasticSearch's capabilities.

OpenSearch is the same as ElasticSearch (since the fork) but as it's license is much more permissive, it's the easy choice as a comparison to Postgres.

In production make sure to evaluate ElasticSearch versus OpenSearch (or other solutions) sufficiently for your use case.

Testing our FTS solutions

Well, how do we test performance of these disparate complex systems on an infinitely redefinable, hard-to-solve problem? By sticking to the use cases.

Any search engine has two jobs:

  • Ingest information (usually referred to as "documents")
  • Return documents that match user queries

Ingesting information can be important, but "search engine performance" usually refers to the second step - taking queries and outputting results.

That is to say, the primary concern is assuming a cluster already has proper data loaded on to it, how fast can it deliver the relevant results for a user-submitted query?

We'll focus on the end-user-centric meaning of performance here (query speed).

Getting a dataset

Hugging Face has a 32MB movie data set. We're interested in the follow columns in this data set:

  • title
  • original_title
  • overview

Postgres is capable of loading CSVs, but since we'll be using other search engines as well, let's convert to format that's much easier to use and process - Newline Delimited JSON.

After a few lines of code and one csv2ndjson.mjs script later, we have a movies.ndjson.json full of JSON documents that we can easily ingest into any search engine (or other database for that matter!).

Inserting the data

Each system ingests data slightly differently, but with a little scripting we can paper over the differences.

The code contains a scheme of "drivers" which are minimal ES6 modules.

Here's an example of the ingest function used for Postgres (with slonik as our DB driver):


_31
// Ingesting search documents
_31
async function ingest({ document, pool }) {
_31
// Ignore rows without proper release dates
_31
if (document.release_date.trim().length === 0) {
_31
return
_31
}
_31
_31
await pool.query(sql`
_31
insert into movies (
_31
id,
_31
title,
_31
original_title,
_31
overview,
_31
released_at
_31
)
_31
values
_31
(
_31
${document.id},
_31
${document.title},
_31
${document.original_title},
_31
${document.overview},
_31
${new Date(document.release_date).toISOString()}::timestamptz
_31
)
_31
on conflict (id) do update set
_31
title=${document.title},
_31
original_title=${document.original_title},
_31
overview=${document.overview},
_31
released_at=${new Date(document.release_date).toISOString()}::timestamptz
_31
;
_31
`)
_31
}

The repo contains the complete code for Postgres as well as the other search engines, written in much this same fashion.

Picking a set of queries

This part is somewhat arbitrary - I've gotten a list of queries here that one might search for in terms of movies:


_10
"super hero"
_10
"superhero"
_10
"superman"
_10
"suprman"
_10
"love"
_10
"world war"
_10
"spy"
_10
"romance"
_10
"comedy"
_10
"awakening"

Thanks to boinboing.net's article from 2018, there are some words that show up quite often in movie titles that we can search for.

The data set isn't very big, but doing all these searches in quick succession should be enough to get a realistic performance baseline.

Running the queries

Similar to ingesting data, all the search engines take slightly different input for queries, so we'll change our simple string queries into whatever the relevant search engine expects to receive.

Here's what that script looks like for Postgres:


_10
// Querying search phrases
_10
async function query({ phrase, pool }) {
_10
const results = await pool.query(sql`
_10
select id, title
_10
from movies
_10
where fts_doc_en @@ websearch_to_tsquery('english', ${phrase}::text)
_10
`)
_10
const ids = results.rows.map((r) => r.id)
_10
return ids
_10
}

No surprises there, just like the Postgres FTS example from earlier.

Automating our toil

Once we've combined our scripting magic with some Makefile incantations, running the queries produces output like this:


_15
TIMING=true FTS_ENGINE=pg make query
_15
[info] importing driver from [./pg.mjs]
_15
[info] finished initializing driver [pg]
_15
[info] processing lines in [/path/to/pg-fts-benchmark/search-phrases.ndjson.json]...
_15
[timing] phrase [super hero]: returned [34] results in 3.132471.ms
_15
[timing] phrase [superhero]: returned [86] results in 1.180798.ms
_15
[timing] phrase [superman]: returned [47] results in 0.912615.ms
_15
[timing] phrase [suprman]: returned [0] results in 0.781712.ms
_15
[timing] phrase [love]: returned [5417] results in 19.088668.ms
_15
[timing] phrase [world war]: returned [834] results in 2.902097.ms
_15
[timing] phrase [spy]: returned [349] results in 1.497892.ms
_15
[timing] phrase [romance]: returned [630] results in 1.91661.ms
_15
[timing] phrase [comedy]: returned [1213] results in 2.534538.ms
_15
[timing] phrase [awakening]: returned [210] results in 2.443748.ms
_15
[info] successfully processed [10] lines

Right off the bat we can see that Postgres is plenty quick! Some results are quite obvious - as you might expect, "love" is quite a common search result.

Results

With the ingest and query machinery defined for the search engines, it's easy to get some results.

What does the data tell us?

There's a lot of tuning left undone here, but a few points stand out right away:

  • Even when consuming similar content, engines can produce different results, but generally ratios between queries on the same engine should be consistent.
  • Postgres FTS is quite close performance-wise to many other solutions, at least in their default configuration.
  • Only Typesense and MeiliSearch properly handled mis-spellings (the "suprman" query).
  • Typesense was relatively strict with matches compared to other engines.
  • OpenSearch was very fast with ingest, but the default configuration doesn't index misspellings.
  • In-memory SQLite is by far the fastest, and PG isn't too far behind for this small data set.

Feedback from the other products

Since this is a benchmark, we felt it was fair to give the other products an opportunity to give their feedback.

Meilisearch

I agree that using PG in simple use cases is an excellent way to be pragmatic. Unfortunately, PG will never be able to offer you a search experience like the one you could have with Meilisearch. Meilisearch has perfect management of typos and searches by prefix, allowing a search at each keystroke. Enhanced relevancy with rules such as the number of words present, the importance of the attribute, the proximity of the terms searched in the document, and custom parameters. The ability to mix textual, geographic, and facet searches. Meilisearch automatically understands almost all languages with the possibility of having synonyms and stop words. Meilisearch has an incredible performance even on large volumes of data, and no backend proxy is needed, thanks to complete API Key management.

Quentin de Quelen, Meilisearch CEO

Typesense

Some more context around Typesense's behavior: how loose or exact matches are done are configurable in Typesense. For eg, you can configure the number of typos that are accounted for (num_typos), and when typo tolerance should kick-in, only if a configurable amount of results are not found (typo_tokens_threshold). For multi keyword searches, you can configure Typesense to expand the search by dropping some of the keywords, until at least a configurable number of search results are found (drop_tokens_threshold). We've chosen specifically to keep the search tight because we received feedback from users that some results were confusing if we let the defaults be too loose.

Jason Bosco, Typesense CEO

Wrap-up

While this certainly isn't "big data", it looks like Postgres has very much held it's own, even without too much wrangling/tuning. Postgres is clearly more than capable of delivering good enough search for this very basic case.

Special purpose systems are expected to out-perform a more general system like Postgres, but clearly result quality and search speed provided Postgres is likely to be good enough for many use cases, and is “complexity neutral” -- no new systems needed!

More Postgres resources

Share this article

Last post

supabase-js v2 Released

20 October 2022

Next post

Supabase Beta September 2022

5 October 2022

Build in a weekend, scale to millions