They're an _amazing_ data structure that, at a fixed size, tracks potential set membership. That means unlike normal b-tree indexes, they don't grow with the number of unique items in the dataset.
This makes them great for "needle in a haystack" search (logs, document) as implementations like VictoriaMetrics and Bing's BitFunnel show. I've used them in the past, but they've never been center-stage in my projects.
I wanted high cardinality keyword search for ANOTHER project... and, well, down the yak-shaving rabbit hole we go!
BloomSearch brings this into an extensible Go package:
- Very memory efficient via bloom filters and streaming row scans
- DataStore and MetaStore interfaces for any backend (can be same or separate)
- Hierarchical pruning via partitions, minmax indexes, and of course bloom filters
- Search by field, token, or field:token with complex combinators
- Disaggregated storage and compute for unbound ingest and query throughput
And of course, you know I had to make a custom file format ^-^ (FILE_FORMAT.MD)
BloomSearch is optimized for massive concurrency, arbitrary cardinality and dataset size, and super low memory usage. There's still a lot on the table too in terms of size and performance optimizations, but I'm already super pleased with it. With distributed query processing I'm targeting >100B rows/s over large datasets.
I'm also excited to replace our big logging bill ~$0.003/GB for log storage with infinite retention and guilt-free querying :P
loading...