Try chemfp. It doesn’t support your format, but it’s easy to convert from your format to the hex-encoded one it does use.
It has Baldi optimization (well, one of them). The popcount code is some of the fastest around; I’m working on a version right now which uses a hardware popcount if your processor supports it, but even the non-chip-specific code is quite fast. (It uses the CPU, not the GPU.)
Do you really want “all-pairwise similarities”, or all similarities over a threshold? The first doesn’t take advantage of the Baldi optimization, and will generate a lot of data. (How do you want that data?)
The timings are highly dependent on what you want to do. Try it out and see. My test case simply counts the number of similar matches which are at least 0.9 similar to the query. The query and target size is 110891 RDKit fingerprints of size 2048 bits, which takes 7 seconds (single threaded) on my desktop, which has special POPCNT hardware. Otherwise it takes 13 seconds.
Also, I’m looking for project funding. Some of the future directions are parallelization, OpenMP, faster support for when the targets and queries are identical, new methods for large sparse fingerprints, and more. Let me know if you’re interested in contributing funding to the effort.