How to Build Efficient Agentic Reasoning Systems by Dynamically Pruning Multiple Chain-of-Thought Paths Without Losing Accuracy

“`html

In this tutorial, we implement an agentic chain-of-thought pruning framework that generates multiple reasoning paths in parallel and dynamically reduces them using consensus signals and early stopping. We focus on improving reasoning efficiency by reducing unnecessary token usage while preserving answer correctness, demonstrating that self-consistency and lightweight graph-based agreement can serve as effective proxies for reasoning quality. We design the entire pipeline using a compact instruction-tuned model and progressive sampling to simulate how an agent can decide when it has reasoned “enough.” Check out the FULL CODES here.

!pip -q install -U transformers accelerate bitsandbytes networkx scikit-learn

import re, time, random, math
import numpy as np
import torch
import networkx as nx
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

MODEL_NAME = “Qwen/Qwen2.5-0.5B-Instruct”

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map=”auto”,
torch_dtype=torch.float16,
load_in_4bit=True
)
model.eval()

SYSTEM = “You are a careful problem solver. Keep reasoning brief and output a final numeric answer.”
FINAL_RE = re.compile(r”Final:\s*([-\d]+(?:\.\d+)?)”)

We set up the Colab environment and load all required libraries for efficient agentic reasoning. We initialize a lightweight instruction-tuned language model with quantization to ensure stable execution on limited GPU resources. We also define global configuration, randomness control, and the core prompting pattern used throughout the tutorial. Check out the FULL CODES here.

def make_prompt(q):
return (
f”{SYSTEM}\n\n”
f”Problem: {q}\n”
f”Reasoning: (brief)\n”
f”Final: “
)

def parse_final_number(text):
m = FINAL_RE.search(text)
if m:
return m.group(1).strip()
nums = re.findall(r”[-]?\d+(?:\.\d+)?”, text)
return nums[-1] if nums else None

def is_correct(pred, gold):
if pred is None:
return 0
try:
return int(abs(float(pred) – float(gold)) < 1e-9)
except:
return int(str(pred).strip() == str(gold).strip())

def tok_len(text):
return len(tokenizer.encode(text))

We define helper functions that structure prompts, extract final numeric answers, and evaluate correctness against ground truth. We standardize how answers are parsed so that different reasoning paths can be compared consistently. We also introduce token-counting utilities that allow us to later measure reasoning efficiency. Check out the FULL CODES here.

@torch.no_grad()
def generate_paths(question, n, max_new_tokens=64, temperature=0.7, top_p=0.9):
prompt = make_prompt(question)
inputs = tokenizer(prompt, return_tensors=”pt”).to(model.device)

gen_cfg = GenerationConfig(
do_sample=True,
temperature=temperature,
top_p=top_p,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
num_return_sequences=n
)

out = model.generate(**inputs, generation_config=gen_cfg)
prompt_tok = inputs[“input_ids”].shape[1]

paths = []
for i in range(out.shape[0]):
seq = out[i]
gen_ids = seq[prompt_tok:]
completion = tokenizer.decode(gen_ids, skip_special_tokens=True)
paths.append({
“prompt_tokens”: int(prompt_tok),
“gen_tokens”: int(gen_ids.shape[0]),
“completion”: completion
})
return paths

We implement fast multi-sample generation that produces several reasoning paths in a single model call. We extract only the generated continuation to isolate the reasoning output for each path. We store token usage and completions in a structured format to support downstream pruning decisions. Check out the FULL CODES here.

def consensus_strength(completions, sim_threshold=0.22):
if len(completions) <= 1:
return [0.0] * len(completions)

vec = TfidfVectorizer(ngram_range=(1,2), max_features=2500)
X = vec.fit_transform(completions)
S = cosine_similarity(X)

G = nx.Graph()
n = len(completions)
G.add_nodes_from(range(n))

for i in range(n):
for j in range(i+1, n):
w = float(S[i, j])
if w >= sim_threshold:
G.add_edge(i, j, weight=w)

strength = [0.0] * n
for u, v, d in G.edges(data=True):
w = float(d.get(“weight”, 0.0))
strength[u] += w
strength[v] += w

return strength

We construct a lightweight consensus mechanism using a similarity graph over generated reasoning paths. We compute pairwise similarity scores and convert them into a graph-based strength signal for each path. It allows us to approximate agreement between reasoning trajectories without expensive model calls.

“` Check out the FULL CODES [here](https://www.example.com/full-codes).