Hacker News new | past | comments | ask | show | jobs | submit login

Clever!

A method that has worked well for me: divorced databases.

The first database is a plaintext database that stores rows: id, data, and metadata and the second database is a vector database that stores id, embedding. whenever a new row is added the first database makes a POST request to the second database. The second database embeds the data and returns the id of its row. The first database uses that ID to store the plain text.

When searching, the second database is optimized for cosine sim with an HNSW index. It returns the IDs to the first database, which fetch the plaintext to return to the user.

The advantages of this are that the plaintext data can be A/B tested across multiple embedding models without affecting the source, and each database can be provisioned for a specific task. Also lowers hosting costs and security because there only needs to be one central vector database and small provisioned plaintext databases.




It sounds like this is pretty similar to the approach that the post is advocating against although I can see your reasoning behind this.


Post-co author here. This is actually something that we are considering implementing in future versions of pgai Vectorizer. You point the vectorizer at database A but tell it to create and store embeddings in database B. You can always do joins across the two databases with postgres FDWs and it would solve issues of load management if those are concerns. Neat idea and one on our radar!


The limitation with that is no hybrid search, which is often needed. “Show me only results for this user or tenant or category etc.”




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: