BioHacker News | Understanding the BM25 full text search algorithm

▲Understanding the BM25 full text search algorithm(emschwartz.me)

282 points by rrampage 1 day ago | 9 comments

▲DavidPP 16 hours ago

We use https://typesense.org/ for regular search, but it now has support for doing hybrid search, curious if anyone has tried it yet?

▲kkielhofner 13 hours ago

I've used it for hybrid search and it works quite well.

Overall I'm really happy to see Typesense mentioned here.

A lot of the smaller scale RAG projects, etc you see around would be well served by Typesense but it seems to be relatively unknown for whatever reasons. It's probably one of the easiest solutions to deploy, has reasonable defaults, good docs, easy clustering, etc while still be very capable, performant, and powerful if you need to dig in further.

▲hubraumhugo 18 hours ago

Given the recent advances in vector-based semantic search, what's the SOTA search stack that people are using for hybrid keyword + semantic search these days?

▲noduerme 16 hours ago

A generic search strategy is so different from something you want to target. The task should probably determine the tool.

So I don't know the answer, but I was recently handed about 3 million surveys with 10 free-form writing fields each, and tasked with surfacing the ones that might require action on the part of the company. I chose to use a couple of different small classifier models, manually strip out some common words based on obvious noise in the first 10k results, and then weight the model responses. It turned out to be almost flawless. I would NOT call this sort of thing "programming", it's more just tweaking the black-box output of various different tools until you have a set of results that looks good for your test cases. (And your client ;)

All stitching together small Hugging Face models running on a tiny server in nodejs, btw.

▲keeeba 8 hours ago

Nice, also find small classifiers work best for things like this. Out of interest, how many, if any, of the 3million were labelled?

Did you end up labelling any/more, or distilling from a generative model?

▲emschwartz 18 hours ago

Most of the commercial and open source offerings for hybrid search seem to be using BM25 + vector similarity search based on embeddings. The results are combined using Reciprocal Rank Fusion (RRF).

The RRF paper is impressive in how incredibly simple it is (the paper is only 2 pages): https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

▲softwaredoug 13 hours ago

A warning that RRF is often not Enough, as it can just drag a good solution down towards the worse solution :)

https://softwaredoug.com/blog/2024/11/03/rrf-is-not-enough

▲emschwartz 12 hours ago

Ah, that's great! Thanks for sharing that.

I had actually implemented full text search + vector search using RRF but I kept it disabled by default because it wasn't meaningfully improving my results. This seems like a good hypothesis as to why.

▲TeenGirlza17 17 hours ago

[flagged]

▲softwaredoug 16 hours ago

My opinion is people need to not focus on one stack. But be prepared to use tools best for each job. Elasticsearch for BM25 type things. Turbopuffer for simple and fast vector retrieval. Even redis to precompute results for certain queries. Or certain extremely dynamic attributes that change frequently like price. Combine all these in a scatter/gather approach.

I say that because almost always you have a layer outside the search stack(s) that ideally can just be a straightforward inference service for reranking that looks most like other ML infra.

You also almost always route queries to different backends based on an understanding of the users query. Routing “lookup by ID” to a different system than “fuzzy semantic search”. These are very different data structures. And search almost always covers very broad/different use cases.

I think it’s an anti pattern to just push all work to one system. Each system is ideal for different workloads. And their inference capabilities won’t ever keep pace with the general ML tooling that your ML engineers are used to. (I tried with Elasticsearch Learning to Rank and its a hopeless task.)

(That said, Vespa is probably the best 'single stack' that tries to solve a broad range of use-cases.)

▲d4rkp4ttern 16 hours ago

In the Langroid[1] LLM library we have a clean, extensible RAG implementation in the DocChatAgent[2] -- it uses several retrieval techniques, including lexical (bm25, fuzzy search) and semantic (embeddings), and re-ranking (using cross-encoder, reciprocal-rank-fusion) and also re-ranking for diversity and lost-in-the-middle mitigation:

[1] Langroid - a multi-agent LLM framework from CMU/UW-Madison researchers https://github.com/langroid/langroid

[2] DocChatAgent Implementation - https://github.com/langroid/langroid/blob/main/langroid/agen...

Start with the answer_from_docs method and follow the trail.

Incidentally I see you're the founder of Kadoa -- Kadoa-snack is one of favorite daily tools to find LLM-related HN discussions!

▲khaki54 16 hours ago

We're doing something like BM25 with a semantic ontology enhanced query (naive example: search for truck hits on Ford F-150, even if truck never appears in the doc) then vector based reranking. In testing, we always get the best result in the top 3.

▲dmezzetti 16 hours ago

Excellent article on BM25!

Author of txtai [1] here. txtai implements a performant BM25 index in Python [2] via the arrays package and storing the term frequency vectors in SQLite.

With txtai, the hybrid index approach [3] supports both convex combination when BM25 scores are normalized and reciprocal rank fusion (RRF) when they aren't [4].

[1] https://github.com/neuml/txtai

[2] https://neuml.hashnode.dev/building-an-efficient-sparse-keyw...

[3] https://neuml.hashnode.dev/benefits-of-hybrid-search

[4] https://github.com/neuml/txtai/blob/master/src/python/txtai/...

▲treprinum 16 hours ago

text-embedding-3-large + SPLADE + RRF

▲jankovicsandras 1 day ago

Shameless plug:

https://github.com/jankovicsandras/plpgsql_bm25

https://github.com/jankovicsandras/bm25opt

▲softwaredoug 15 hours ago

If we're shameless plugging passion projects, SearchArray is a pandas extension for fulltext (BM25) search for dorking around with things in google colab

https://github.com/softwaredoug/searcharray

I'll also plug Xing Han Lu's BM25S which is very popular with similar goals:

https://github.com/xhluca/bm25s

▲mark_l_watson 19 hours ago

Thanks, yesterday I was thinking of adding BM25 to a little side project, so a well timed plug!

Do you know of any pure Python wrapper projects for managing large numbers of text and PDF documents? I thought of using Solr or ElasticSearch but that seems too heavy weight for what I am doing. I am considering using SQLite with pysqlite3 and PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I imagine many people are looking at tools for building hybrid BM25 / vector store / LLM applications.

▲rogerbinns 6 hours ago

My project APSW may have exactly what you need. It wraps SQLite proving a Python API, and that includes the FTS5 full text search functionality. https://rogerbinns.github.io/apsw/textsearch.html

You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.

You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.

There is a lot more useful functionality, all done from Python. You can see code in action in the example/tour at https://rogerbinns.github.io/apsw/example-fts.html

▲radiator 6 hours ago

Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?

▲jll29 22 hours ago

Nice write-up.

A few more details/background that are harder to find: "BM25" stands for "Best Matching 25", "best matching" becaue it is a formula for ranking and term weighting (the matching refers to the term in the query versus the document), and the number 25 simply indicates a running number (there were 24 earlier formula variants and some later ones, but #25 turned out to work best, so it was the one that was published).

It was conceived by Stephen Robertson and Karen Spärck Jones (the latter of IDF fame) and first implemented in the former's OKAPI information retrieval (research) system. The OKAPI system was benchmarked at the annual US NIST TREC (Text Retrieval Conference) for a number of years, the international "World Champtionship" of search engine methods (although the event is not about winning, but about compariing notes and learning from each other, a highly recommended annual event held every November in Gaithersburg, Maryland, attended by global academic and industry teams that conduct research on improving search - see trec.nist.gov).

Besides the "bag of words" Vector Space Model (sparse vectors of terms), the Probabilistic Modles (that BM25 belongs to), there are suprising and still growing number of other theoretical frameworks how to rank a set of documents, given a query ("Divergence from Randomness", "Statistical Language Modeling, "Learning to Rank", "Quantum Information Retrieval", "Neural Ranking" etc.). Conferences like ICTIR and SIGIR still publish occasionaly entirely new paradigms for search. Note that the "Statistical Language Modeling" paradigm is not about Large Language Models that are on vogue now (that's covered under the "Neural Retrieval" umbrella), and that "Quantum IR" is not going to get you to a tutorial about Quantum Information Retrieval but to methods of infrared spectroscopy or a company with the same name that produces cement; such are the intricacies of search technology, even in the 21st century.

If you want to play with BM25 and compare it with some of the alternatives, I recommend the research platform Terrier, and open-source search engine developed at the University of Glasgow (today, perhaps the epicenter of search research).

BM25 is over a quarter century old, but has proven to be a hard baseline to beat (it is still often used as a reference point for comparing new nethods against), and a more recent variant, BM24F, can deal with multiple fields and hypertext (e.g. title, body of documents, hyperlinks).

The recommended paper to read is: Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management 36(6): 779–808, and its successor, Part 2. (Sadly they are not open access.)

▲CSSer 12 hours ago

Coincidentally, US NIST TREC is happening right now! It started on the 18th and will conclude on the 22nd.

Link for more details: https://trec.nist.gov/

▲marcyb5st 21 hours ago

Thanks for sharing!

Do you have more information about BM24F? Googling (also Google scholar) didn't yield anything related. Thanks in advance!

▲bradleyjkemp 21 hours ago

A typo I think, should be BM25F. From Wikipedia:

> BM25F (or the BM25 model with Extension to Multiple Weighted Fields) is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) https://en.wikipedia.org/wiki/Okapi_BM25

Some papers are linked in the references

▲marcyb5st 18 hours ago

Thanks, really appreciate it!

▲MPSimmons 14 hours ago

Does anyone know if the average document length mentioned in the document length normalization is median? It seems like it would need to be to properly deweight excessively long documents, otherwise the excessively long documents would unfairly weight the average, right?

▲softwaredoug 13 hours ago

It’s the mean. At least in Lucene. Using median would be an interesting experiment.

Do you know of a search dataset with very large document length differences? MSMarco for example is pretty consistent in length.

▲sidcool 21 hours ago

Good article. I am genuinely interested to learn about how to think of problems in such a mathematical form. And how to test it. Any resources?

▲ 16 hours ago

▲tselvaraj 15 hours ago

Hybrid search solves the long-standing challenge of relevance with search results. We can use ranking fusion between keyword and vector to create a hybrid search that works in most scenarios.

▲RA_Fisher 19 hours ago

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

▲mrbungie 19 hours ago

Are those the same market pressures that made Google discard or repurpose a lot of working old search tech for new shiny ML-based search tech? The same tech that makes you add "+reddit" in every search so you can evade the adversarial SEO war?

PS: Ancient != bad. I don't know what weird technologist take worries about the age of an invention/discovery of a technique rather than its usefulness.

▲RA_Fisher 17 hours ago

Google’s come a long way since PageRank + terms. Ancient doesn’t mean bad, but usually it means outdated and that’s the case here. Search algos are subsumed by learning models, our species can do better now.

▲mbreese 17 hours ago

So, I’m not entirely sure if I follow you here… How would one use a language model to find a document out of a corpus of existing documents? As opposed to finding an answer to a question, trained on documents, which I can see. I mean answering a query like “find the report containing X”?

I see search as encompassing at least two separate, but related, domains: information gathering/seeking (answering a question) and information retrieval (find the best matching document). I’m curious how LLMs can help with the later.

▲ordersofmag 14 hours ago

That's the 'vector search' people are talking about in this discussion. Use the LLM to generate an embedding vector that represents the 'meaning' of your query. Do the same for all the documents (or better with chunks of all the documents). Find the document vector that's closest to your query vector and you have a document that has a 'meaning' similar to your query. Obviously that's just a starting point. And lots of folks are doing hybrid where they combine bm25 search with some sort of vector search (e.g. run them in parallel and combine results, or do a bm25 and then use vector search to rerank the top results).

▲netdur 19 hours ago

While BM25 did emerge from earlier work in the 1970s and 1980s (specifically building on the probabilistic ranking principle), I'm curious about your perspective on a few things:

What specific modern statistical approaches are you seeing as superior replacements for BM25 in practical applications? I'm particularly interested in how they handle edge cases like rare terms and document length normalization that BM25 was explicitly designed to address.

While I agree learning-based approaches have shown impressive results, could you elaborate on what you mean by search being "strictly dominated" by learning methods? Are you referring to specific benchmarks or real-world applications?

▲RA_Fisher 17 hours ago

BM25 can be used as a starting point for a statistical learning model and more readily built on. A key advantage is that one gains a systematic way to reduce edge cases, instead of handling a couple, bc they’re so large as to be noticeable.

▲simplecto 19 hours ago

Those are some really spicy opinions. It would seem that many search experts might not agree.

David Tippet (formerly opensearch and now at Github)

A great podcast with David Tippet and Nicolay Gerold entitled:

"BM25 is the workhorse of search; vectors are its visionary cousin"

https://www.youtube.com/watch?v=ENFW1uHsrLM

▲RA_Fisher 17 hours ago

I’m sure Search experts would disagree, because it’d be their technology they’d be admitting is inferior to another. BM25 is the workhorse, no doubt— but it’s also not the best anymore. Vectors are a step toward learning models, but only a small mid-range step vs. an explicit model.

Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).

▲dtaivpp 10 hours ago

I have been summoned. Hey it's David from the podcast. As someone who builds search for users every day and shaped the user experience for vector search at OpenSearch I assure you no one is afraid of their technology becoming inferior.

There are two components of search that are really important to understand why BM25 (will likely) not go away for a long time. The first is precision and the second is recall. Precision is the measure of how many relevant results were returned in light of all the results returned. A completely precise search would return only the relevant results and no irrelevant results.

Recall on the other hand measures how many of all the relevant results were returned. For example, if our search only returns 5 results but we know that there were 10 relevant search results that should have been returned we would say the recall is 50%.

These two components are always at odds with each other. Vector search excels at increasing recall. It is able to find documents that are semantically similar. The problem with this is semantically similar documents might not actually be what the user is looking for. This is because vectors are only a representation of user intent.

Heres an example: A user looks up "AWS Config". Vector search would read this and may rate it as similar to ["amazon web services configuration", "cloud configuration", "infrastructure as a service setup"]. In this case the user was looking for a file called, "AWS.config". Vector search is inherently imprecise. It is getting better but it's not replacing BM25 as a scoring mechanism any time soon.

You don't have to believe me though. Weaviate, Vespa, Qdrant all support BM25 search for a reason. Here is an in depth blog that dives more into hybrid search: https://opensearch.org/blog/hybrid-search/

As an aside, vector search is also much more expensive than BM25. It's very hard to scale and get precise results.

▲RA_Fisher 10 hours ago

Hi David. Nice to meet you. Yes, precision and recall are always in tension. However, both can be made simultaneously better with a more informed model. Using your example, this would be a model that encodes the concept of files in the context of a user demand surrounding AWS.

▲iosjunkie 9 hours ago

"more informed model"

Can you be specific on what you recommend instead of BM25?

▲RA_Fisher 4 hours ago

Sure, this chat describes what I have in mind, https://chatgpt.com/share/673e9290-a044-8005-995b-166efe653e...

▲softwaredoug 15 hours ago

I don't know a lot of search practitioners who don't want to use the "new sexy" thing. Most of us do a fair amount of "resume driven development" so can claim to be "AI Engineers" :)

▲RA_Fisher 13 hours ago

I don’t think it’s realistic to think that software engineers can pick up advanced statistical modeling on the job, unless they’re pairing with statisticians. There’s just too much background involved.

▲softwaredoug 11 hours ago

The "search practitioners" I'm referring to are pretty uniformly ML Engineers . They also work on feeds, recommendations, and adjacent Information Retrieval spaces. Both to generate L0 retrieval candidates and to do higher layers of reranking with learning to rank and other systems to whatever the system's goal is...

You can decide if you agree that most people are sufficiently statistically literate in that group of people. But some humility around statistics is probably far up there in what I personally interview for.

▲RA_Fisher 9 hours ago

For sure. There are ML folks with statistical learning backgrounds, but it tends to be relatively rare. Physics and CS are more common. They tend to view things like you mention, more procedural eg- learning to rank, minimizing distances, less statistical modeling. Humility around statistics is good, but statistical knowledge is still what's required to really level up these systems (I've built them as well).

▲binarymax 12 hours ago

Your overall condescending attitude in this thread is really disgusting.

▲RA_Fisher 12 hours ago

Statisticians are famously disliked, especially by engineers (there are open-minded folks, of course! maybe they’d taken some econometrics or statistics, are exceptionally humble, etc). There are some interesting motives and incentives around that. Sometimes I think in part it’s because many people would prefer their existing beliefs be upheld as opposed to challenged, even if they’re not well-supported (and likely to lead to bad decisions and outcomes). Sticking with outdated technology is one example.

▲simplecto 16 hours ago

It seems that the current mode (eg fashion) is a hybrid approach, with vector results on one side, BM25 on the other, and then a re-reank algo to smooth things out.

I'm out of my depth here but genuinely interested and curious to see over the horizon.

▲RA_Fisher 12 hours ago

Best is to use one statistical model and encode the underlying aspects of the context that relate to goal outcomes.

▲authorfly 16 hours ago

Out of interest how come you use the word "mode" here?

▲simplecto 16 hours ago

because the space moves fast, and from my learning this is the current thing. Like fashion -- it changes from season to season

▲dumb1224 18 hours ago

Agreed. In the 2000s it was all about BM25 in the NLP community. I hardly see any paper that did not mention it in my opinion.

▲RA_Fisher 17 hours ago

For sure, it’s very popular, just not the best anymore (and actually far from it).

▲authorfly 16 hours ago

And dependency chaining. But yes, lots of BM25.

The 2000s and even 2010s was a wonderful and fairly theoretical time for linguistics and NLP. A time when NLP seemed to harbor real anonymized general information to make the right decisions with, without impinging on privacy.

Oh to go back.

▲softwaredoug 15 hours ago

I think there are also incentives to "sell new things". That's always been the case in search which has had a bazillion trends and "AI related things" as long as I've worked in it. We have massively VC funded vector search companies with armies of tech evangelists pushing a specific point of view right now.

Meanwhile, the amount of manual curation, basic, boring hand-curated taxonomies that actually drive things like "semantic search" at places like Google are simply staggering. Just nobody talks about them much at conferences because they're not very sexy.