How to Build an Atomic-Agents RAG Pipeline with Typed Schemas, Dynamic Context Injection, and Agent Chaining

“`html

In this tutorial, we build an advanced, end-to-end learning pipeline around Atomic-Agents by wiring together typed agent interfaces, structured prompting, and a compact retrieval layer that grounds outputs in real project documentation. Also, we demonstrate how to plan retrieval, retrieve relevant context, inject it dynamically into an answering agent, and run an interactive loop that turns the setup into a reusable research assistant for any new Atomic Agents question. Check out the FULL CODES here.

import os, sys, textwrap, time, json, re
from typing: List, Optional, Dict, Tuple
from dataclasses: dataclass
import subprocess
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”,
“atomic-agents”, “instructor”, “openai”, “pydantic”,
“requests”, “beautifulsoup4”, “scikit-learn”])
from getpass import getpass
if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (input hidden): “).strip()
MODEL = os.environ.get(“OPENAI_MODEL”, “gpt-4o-mini”)
from pydantic: Field
from openai: OpenAI
import instructor
from atomic_agents: AtomicAgent, AgentConfig, BaseIOSchema
from atomic_agents.context: SystemPromptGenerator, ChatHistory, BaseDynamicContextProvider
import requests
from bs4: BeautifulSoup

We install all required packages, import the core Atomic-Agents primitives, and set up Colab-compatible dependencies in one place. We securely capture the OpenAI API key from the keyboard and store it in the environment so downstream code never hardcodes secrets. We also lock in a default model name while keeping it configurable via an environment variable.

def fetch_url_text(url: str, timeout: int = 20) -> str:
r = requests.get(url, timeout=timeout, headers={“User-Agent”: “Mozilla/5.0”})
r.raise_for_status()
soup = BeautifulSoup(r.text, “html.parser”)
for tag in soup([“script”, “style”, “nav”, “header”, “footer”, “noscript”]):
tag.decompose()
text = soup.get_text(“\n”)
text = re.sub(r”[ \t]+”, ” “, text)
text = re.sub(r”\n{3,}”, “\n\n”, text).strip()
return text

def chunk_text(text: str, max_chars: int = 1400, overlap: int = 200) -> List[str]:
if not text:
return []
chunks = []
i = 0
while i < len(text):
chunk = text[i:i+max_chars].strip()
if chunk:
chunks.append(chunk)
i += max_chars – overlap
return chunks

def clamp(s: str, n: int = 800) -> str:
s = (s or “”).strip()
return s if len(s) <= n else s[:n].rstrip() + “…”

We fetch web pages from the Atomic Agents repo and docs, then clean them into plain text so retrieval becomes reliable. We chunk long documents into overlapping segments, preserving context while keeping each chunk small enough for ranking and citation. We also add a small helper to clamp long snippets so our injected context stays readable.

from sklearn.feature_extraction.text: TfidfVectorizer
from sklearn.metrics.pairwise: cosine_similarity

@dataclass
class Snippet:
doc_id: str
url: str
chunk_id: int
text: str
score: float

class MiniCorpusRetriever:
def __init__(self, docs: Dict[str, Tuple[str, str]]):
self.items: List[Tuple[str, str, int, str]] = []
for doc_id, (url, raw) in docs.items():
for idx, ch in enumerate(chunk_text(raw)):
self.items.append((doc_id, url, idx, ch))
if not self.items:
raise RuntimeError(“No documents were fetched; cannot build TF-IDF index.”)
self.vectorizer = TfidfVectorizer(stop_words=”english”, max_features=50000)
self.matrix = self.vectorizer.fit_transform([it[3] for it in self.items])

def search(self, query: str, k: int = 6) -> List[Snippet]:
qv = self.vectorizer.transform([query])
sims = cosine_similarity(qv, self.matrix).ravel()
top = sims.argsort()[::-1][:k]
out = []
for j in top:
doc_id, url, chunk_id, txt = self.items[j]
out.append(Snippet(doc_id=doc_id, url=url, chunk_id=chunk_id, text=txt, score=float(sims[j])))
return out

class RetrievedContextProvider(BaseDynamicContextProvider):
def __init__(self, title: str, snippets: List[Snippet]):
super().__init__(title=title)
self.snippets = snippets

def get_info(self) -> str:
blocks = []
for s in self.snippets:
blocks.append(
f”[{s.doc_id}#{s.chunk_id}] (score={s.score:.3f}) {s.url}\n{clamp(s.text, 900)}”
)
return “\n\n”.join(blocks)

We build a mini retrieval system using TF-IDF and cosine similarity over the chunked documentation corpus. We wrap each retrieved chunk in a structured Snippet object to track doc IDs, chunk IDs, and citation scores. We then inject top-ranked chunks into the agent’s runtime via a dynamic context provider, keeping the answering agent grounded.

“` For the full code, please refer to the link provided. Stay updated by following us on Twitter and joining our ML SubReddit with over 100k members. Don’t forget to subscribe to our newsletter for more updates.

How to Build an Atomic-Agents RAG Pipeline with Typed Schemas, Dynamic Context Injection, and Agent Chaining

Be the first to comment

Leave a Reply Cancel reply