LangChainでCSVをベクター化してみました
RAGで使う準備です
Vector化するのに Faiss | 🦜️🔗 LangChain を使いました
FaissはFacebook AI Similarity Searchだそうです
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
FAISS_DB_PATH = "./faiss_db"
loader = CSVLoader(file_path="./sample.csv")
embeddings = OpenAIEmbeddings()
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
vector = FAISS.from_documents(documents, embeddings)
vector.save_local(FAISS_DB_PATH)
retriever = vector.as_retriever()
# 試しに"IoT"で検索してみる
query = "IoT"
context_docs = retriever.get_relevant_documents(query)
print(f"len={len(context_docs)}")
first = context_docs[0]
print(f"metadata={first.metadata}")
print(first.page_content)
これでデータが ./faiss_db
ディレクトリに保存されます
取り出すときは
vector = FAISS.load_local(
FAISS_DB_PATH, embeddings, allow_dangerous_deserialization=True
)
で取り出せます
allow_dangerous_deserialization=True
がないと
ValueError: The de-serialization relies loading a pickle file. Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set `allow_dangerous_deserialization` to `True` to enable deserialization. If you do this, make sure that you trust the source of the data. For example, if you are loading a file that you created, and no that no one else has modified the file, then this is safe to do. Do not set this to `True` if you are loading a file from an untrusted source (e.g., some random site on the internet.).
の警告出ます
警告通りの危険性は理解して使ってください
RAGを使いこなすにも準備が色々いりますね
PR
このリンクは、アフィリエイトリンクです