# (abandoned, too small documents)Creating RAG from scratch


https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval.html

### Postgresql with pgvector dockerfile
```
version: "3.6"

services:
  postgres-pgvector:
    image: ankane/pgvector:latest
    container_name: postgres-pgvector
    restart: always
    volumes:
      - postgres_pgvector_volume:/var/lib/postgresql/data
    ports:
      - 5432:5432
    environment:
      POSTGRES_USER: martin
      POSTGRES_PASSWORD: password123
volumes:
  postgres_pgvector_volume:
  ```

In [5]:
import psycopg2

db_name = "vector_db"
host = "localhost"
password = "password"
port = "5432"
user = "user"
# conn = psycopg2.connect(connection_string)
conn = psycopg2.connect(
    dbname='postgres',
    host=host,
    password=password,
    port=port,
    user=user,
)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

In [36]:
from sqlalchemy import make_url
from llama_index.vector_stores.postgres import PGVectorStore

vector_store = PGVectorStore.from_params(
    database=db_name,
    host=host,
    password=password,
    port=port,
    user=user,
    table_name="transcription",
    embed_dim=384,  # openai embedding dimension
)

### Use a Text Splitter to Split Documents

In [37]:
from llama_index.core.node_parser import SentenceSplitter

text_parser = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

In [38]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="./txt_files/")
documents = reader.load_data()

In [39]:
documents

[Document(id_='1e1bcc2c-7b59-4755-ba9f-5906148167b0', embedding=None, metadata={'file_path': '/home/jrosh/Projects/whispertest/txt_files/test_audio_shell_lecture.txt', 'file_name': 'test_audio_shell_lecture.txt', 'file_type': 'text/plain', 'file_size': 45833, 'creation_date': '2024-03-21', 'last_modified_date': '2024-03-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=" All right, everyone. Thanks for coming in. This is the missing semester of your CS education. At least that's what we chose to call the class. If you're not here for this class, then you're in the wrong room. We will be here for about an hour, just to set your expectations. And I want to talk to you a little bit first about why we're doing this class. So this class stems out of an obs

In [40]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

In [41]:
from llama_index.core.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    print(node)
    nodes.append(node)

Node ID: 14f41cf9-be80-49ba-868a-7f37b3e38d16
Text: All right, everyone. Thanks for coming in. This is the missing
semester of your CS education. At least that's what we chose to call
the class. If you're not here for this class, then you're in the wrong
room. We will be here for about an hour, just to set your
expectations. And I want to talk to you a little bit first about why
we're doing this ...
Node ID: 8101c9f5-8442-4091-8b76-00e60c602825
Text: So take advantage of the fact that we're here. This class is
going to, I don't want to say ramp up quickly, but what's going to
happen over the course of this particular lecture is it will cover
many of the basics that we assume that you will know for the rest of
the semester, things like how to use your shell and your terminal. And
I'll explain...
Node ID: 460abd75-0511-4b31-8820-d911e93d2746
Text: Usually, it'll be something like executing programs with
arguments. What does that look like? Well, one program we can execute
is the date pro

In [42]:
# sentence transformers
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")


In [44]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding
    print(node_embedding)


[-0.060712628066539764, -0.02310643345117569, 0.011446606367826462, -0.0703437477350235, 0.04156683385372162, -0.031214579939842224, -0.031966645270586014, 0.004196931608021259, -0.008774680085480213, -0.051531944423913956, 0.04007921740412712, 0.010612113401293755, 0.013570002280175686, -0.028641855344176292, 0.06293020397424698, 0.012507150880992413, 0.0074752443470060825, -0.0321357399225235, -0.01604590378701687, 0.0011007589055225253, 0.025750016793608665, -0.012666643597185612, -0.007408031262457371, -0.03456125035881996, 0.002670475048944354, 0.02272435836493969, 0.08327215164899826, -0.07686217129230499, -0.005974041763693094, -0.18700295686721802, 0.013395211659371853, -0.014004806987941265, 0.065121129155159, -0.011722243390977383, 0.01375389564782381, -0.027596743777394295, 0.04479644075036049, -0.01723567768931389, -0.09018917381763458, 0.01304884348064661, 0.00887093972414732, -0.003886748105287552, 0.02116154134273529, -0.0008597108535468578, 0.035171810537576675, -0.0567

In [63]:
query_str = "Bash"
query_embedding = embed_model.get_query_embedding(query_str)

In [64]:
query_embedding

[-0.0655275285243988,
 0.00894299615174532,
 -0.02082711085677147,
 -0.011000980623066425,
 -0.003246152773499489,
 -0.05166742578148842,
 -0.01613280549645424,
 0.0027699004858732224,
 0.027013229206204414,
 -0.014961343258619308,
 -0.01613481156527996,
 0.020242992788553238,
 0.055608779191970825,
 -0.038164157420396805,
 0.02757468819618225,
 0.006564107723534107,
 0.009144391864538193,
 0.10287594795227051,
 -0.014657911844551563,
 0.03872257098555565,
 0.0039942641742527485,
 0.01933012530207634,
 0.015650464221835136,
 -0.005154171027243137,
 -0.035742755979299545,
 0.054819319397211075,
 0.008804966695606709,
 -0.021953899413347244,
 -0.04366825893521309,
 -0.16800935566425323,
 0.03506435826420784,
 0.01266274694353342,
 0.020349537953734398,
 0.0025566352996975183,
 0.05346674472093582,
 0.061503782868385315,
 0.019158581271767616,
 0.04917735606431961,
 0.01930895447731018,
 0.02721797302365303,
 0.06182744726538658,
 -0.027977142482995987,
 -0.015282899141311646,
 -0.0004241

In [71]:
# construct vector store query
from llama_index.core.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
#uery_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=10, mode=query_mode
)

In [72]:
vector_store_query

VectorStoreQuery(query_embedding=[-0.0655275285243988, 0.00894299615174532, -0.02082711085677147, -0.011000980623066425, -0.003246152773499489, -0.05166742578148842, -0.01613280549645424, 0.0027699004858732224, 0.027013229206204414, -0.014961343258619308, -0.01613481156527996, 0.020242992788553238, 0.055608779191970825, -0.038164157420396805, 0.02757468819618225, 0.006564107723534107, 0.009144391864538193, 0.10287594795227051, -0.014657911844551563, 0.03872257098555565, 0.0039942641742527485, 0.01933012530207634, 0.015650464221835136, -0.005154171027243137, -0.035742755979299545, 0.054819319397211075, 0.008804966695606709, -0.021953899413347244, -0.04366825893521309, -0.16800935566425323, 0.03506435826420784, 0.01266274694353342, 0.020349537953734398, 0.0025566352996975183, 0.05346674472093582, 0.061503782868385315, 0.019158581271767616, 0.04917735606431961, 0.01930895447731018, 0.02721797302365303, 0.06182744726538658, -0.027977142482995987, -0.015282899141311646, -0.00042416309588588

In [73]:
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)
print(query_result)

VectorStoreQueryResult(nodes=[], similarities=[], ids=[])
