Question: for tables in a relational db where information is split across tables, i aussume I'd want to join them do that records of a transaction would contain information instead of a bunch of keys. This will naturally infltate the serialized docs and result in a rag db that'll take up large storage space.
I'm thinking that once the embedding vectors are created, I'd want to throw away the serialized docs and save the embeddings into the db in a rable that contains records of embedding and jeys into tables from where the data was pulled from to create the embedded doc.
Read to chapter two (table RAG) over breakfast.
Question: for tables in a relational db where information is split across tables, i aussume I'd want to join them do that records of a transaction would contain information instead of a bunch of keys. This will naturally infltate the serialized docs and result in a rag db that'll take up large storage space.
I'm thinking that once the embedding vectors are created, I'd want to throw away the serialized docs and save the embeddings into the db in a rable that contains records of embedding and jeys into tables from where the data was pulled from to create the embedded doc.
What do you think? Should this go in the book?
Sorry for all the typos. Didn't realize I can't edit my comment
Yeah, your instinct is right. That's basically what most production setups do.
Two things worth separating though:
First, denormalizing at indexing time. Yes, definitely. The LLM can't follow foreign
keys, so each chunk needs to stand on its own. Join the tables, turn the row into natural
language, embed that.
Second, throwing away the serialized doc. Also reasonable. Store the embedding with the
source table name and primary key, and just rehydrate from the relational DB when you
actually need the full record at query time. The nice side effect is that your relational
DB stays the single source of truth, so when a row gets updated you don't have stale
text sitting in your vector store going out of sync.
The one case where I'd keep the serialized text next to the vector is if rebuilding it is
expensive (like a big multi-table join) or if you're latency sensitive. But then treat
it as a cache, not the truth, and make sure you have a way to invalidate it when the
underlying data changes.
So basically: pointers plus lazy rehydration by default, and only cache the text if you
actually measure a problem.
The Good thing about kindle is you can also buy in the us store and it'll be shipped tarrif free to you where you are
yap. you can also find it in other stores world wide, not only the us :)
Can you make some ideas from book ?
what do you mean?
Try to buy😊
how can I buy...
I m not able buy Kindle edition.it showing not availble in ur country