> Vector databases treat embeddings as independent data, divorced from the sourc...

avthar · 2024-10-30T00:11:56 1730247116

Post co-author here. The point is a little nuanced, so let me explain:

You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.

The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.

jeffchuber · 2024-10-30T00:27:14 1730248034

i know it is the case in chroma this is supported out of the box with 0 lines of code. i’m pretty sure it’s supported everywhere else in no more than 3 lines of code.

spmurrayzzz · 2024-10-30T03:03:54 1730257434

This is also the case with weaviate (as you assumed). If you update the value of any previously vectorized property, weaviate generates new vectors automatically for you.

cevian · 2024-10-30T00:39:24 1730248764

as far as I can tell Chroma can only store chunks, not the original documents. This is from your docs `If the documents are too large to embed using the chosen embedding function, an exception will be raised`.

In addition it seems that embeddings happen at ingest time. So, if, for example, the OpenAI endpoint is down the insert will fail. That, in turn means your users need to use a retry mechanism and a queuing system. All the complexity we describe in our blog.

Obviously, I am not an expert in Chroma. So apologies in advance if I got anything wrong. Just trying to get to the heart of the differences between the two systems.

spmurrayzzz · 2024-10-30T03:27:33 1730258853

Chroma certainly doesn't have the most advanced API in this area, but you can for sure store chunks or documents, its up to you. If your document size is too large to generate embeddings in a single forward pass, then yes you do need to chunk in that scenario.

Oftentimes though, even if the document does fit, you choose to chunk anyways or further transform the data with abstractive/extractive summarization techniques to improve your search dynamics. This is why I'm not sure the complexity noted in the article is relevant in anything beyond a "naive RAG" stack. How its stored or linked is an issue to some degree, but the greater more complex smell is in what happens before you even get to that point of inserting the data.

For more production-grade RAG, just blindly inserting embeddings wholesale for full documents is rarely going to get you great results (this varies a lot between document sizes and domains). So as a result, you're almost always going to be doing ahead-of-time chunking (or summarization/NER/etc) not because you have to due to document size, but because your search performance demands it. Frequently this involves more than one embeddings model for capturing different semantics or supporting different tasks, not to mention reranking after the initial sweep.

That's the complexity that I think is worth tackling in a paid product offering, but the current state of the module described in the article isn't really competitive with the rest of the field in that respect IMHO.

cevian · 2024-10-30T08:55:46 1730278546

(Post co author) We absolutely agree that chunking is critical for good RAG. What I think you missed in our post is that the vectorizer allows you to configure a chunking strategy of your choice. So you store the full doc but then the system well chunk and embed it for you. We don’t blindly embed the full document.

spmurrayzzz · 2024-10-30T13:12:19 1730293939

I didn't miss that detail, I just don't think chunking alone is where the complexity lies and that the pgai feature set isn't really differentiated at all from other offerings in that context. My commentary about full documents was responding directly to your comment here in this thread more so than I was the article (you claimed chroma can only insert chunks, which isn't accurate, and I expanded from there).

samatman · 2024-10-30T16:52:27 1730307147

This is true only in the trivial case where the entire document fits in a single chunk, correct?

That seems like a meaningful distinction.

spmurrayzzz · 2024-10-30T20:10:07 1730319007

Yes that is correct, but my position (which perhaps has been poorly-articulated) is that in the non-trivial instances, it is a distinction without difference in the greater context of the RAG stack and related pipelines.

Just allowing for a chunking function to be defined which is called at insertion time doesn't really alleviate the major pain points inherent to the process. Its a minor convenience, but in fact, as pointed out elsewhere in this thread by others, its a convenience you can afford to yourself in a handful of lines of code that you only ever have to write once.