
Introduction
We began our series on Generative AI with Retrieval Augmented Generation (RAG), focusing on its applications in the energy sector.
In our first post, we explored Tavily Search as a web search tool integrated with LangChain.
While RAG is a powerful tool, more complex agentic workflows combine RAG with other agents and tools depending on specific needs.

In this blog-post we continue working with the Energy dataset from PJM Interconnection LLC, a regional transmission organization (RTO) in the United States. Serving all or parts of Delaware, Illinois, Indiana, Kentucky, Maryland, Michigan, New Jersey, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, and the District of Columbia.
We have scaled up the experiments to cover the files:
AEP_hourly.csv
COMED_hourly.csv
DAYTON_hourly.csv
DEOK_hourly.csv
DOM_hourly.csv
DUQ_hourly.csv
EKPC_hourly.csv
FE_hourly.csv
NI_hourly.csv
PJME_hourly.csv
PJMW_hourly.csv
Agents
Our dataset consists of datetime and consumption columns, making it ideal for an agentic workflow where LLMs generate code for analytics tasks.
In our previous RAG post, we observed that LLMs naturally generate Python code when given structured data.
To enhance control, we explicitly define system prompts with:
- Clear data structure definitions
- Rules for formatting responses as code
These prompts are then combined with user queries, allowing users to interact with the system in natural language—without any coding knowledge.
Our first version of this system is illustrated in Figure 1.

Code Processing
As shown in Figure 1, the process begins when a user submits a query. The LLM interprets the request and returns an answer—often containing code.
We then:
- Extract the generated code
- Pass it to an execution module
- Feed in our dataset as input
- Return the final computed result
Using Python & SQL for Querying
To simplify querying, we leverage DuckDB, which allows us to use SQL directly on local datasets.
In Figure 2, we show an example where GPT-4o generates an SQL query and its corresponding result.
User Query:
“Find the 7 highest peaks of consumption between December 1 and December 10, 2015.”

The generated SQL query is extracted, executed with DuckDB, and the results are returned.
While we find SQL-based querying effective, our focus in this blog post remains on Python-based coding agents.
Data Privacy
One major advantage of our agentic pipeline is data privacy. Unlike traditional approaches where raw data is sent to an LLM for processing, our system ensures that:
- LLMs never receive actual data → Instead of passing the dataset, we only provide the schema details (e.g., column names and data types).
- No sensitive information leaves our system → The LLM generates code, which is then executed locally within a controlled environment.
- Custom column names can be used → Even if privacy policies prevent sharing real column names, we can map generic names (e.g.,
Column_A
) to their real counterparts.
By following this approach, organizations can leverage LLMs without violating privacy regulations such as GDPR and EU AI Act.
Experiments
We conducted a simple experiment by asking the following questions:
- Find the highest/lowest monthly mean consumption for a given year (
YYYY
). - Find the highest/lowest monthly standard deviation of consumption for a given year (
YYYY
).
These questions were applied to each year (YYYY
) in all datasets listed above.
While some datasets span 2004–2018, others have shorter time frames.
Each question was appended to our system prompt, and we reset the experiment before every iteration to simulate a cold start scenario.
Future improvements could include caching responses or adding memory to optimize the pipeline.
Language Models (LMs)

To evaluate the coding capabilities of LLMs, we tested models that were both correct and consistent while following instructions effectively. The models were divided into two categories:
Closed-Source Models
These models were accessed via APIs using LangChain, though other frameworks could also be used:
GPT-4o
GPT-4o-mini
Mistral-Large-Latest
Open-Weight Models
These models were run locally using Ollama, imposing computational constraints:
LLaMA 3.1 8B
CodeLlama 7B
CodeLlama 13B
Mistral v3 7B
Phi-4 14B
DeepSeek-R1 8B
DeepSeek-R1 14B
(incomplete results)
Among these, DeepSeek-R1 8B performed less accurately than its parent model, LLaMA 3.1 8B
, from which it was distilled.
While DeepSeek-R1 14B achieved perfect accuracy in a single test, we abandoned further experiments due to high computational costs.
Similarly, Phi-4 14B
performed well but was relatively slow on our local setup.
Evaluations
The primary evaluation metric was accuracy—whether the generated code produced the correct numerical result.
To compute accuracy: 1. We manually calculated gold-standard statistics for comparison. 2. Each LLM-generated solution was compared against these ground truths. 3. Accuracy was measured as a percentage of correct results across all dataset years.
- 100% Accuracy → The model produced correct results for all tested years.
- <100% Accuracy → The model either generated incorrect results or produced code with syntax errors, preventing execution.
Each dataset corresponds to a column in Tables 1–4, with accuracy values averaged across all years.
Results
The following models achieved 100% accuracy across all datasets:
✅ GPT-4o
✅ GPT-4o-mini
✅ Mistral-Large-Latest
✅ Phi-4 14B
✅ CodeLlama 13B
✅ LLaMA 3.1 8B
For other models, accuracy varied:
CodeLlama 7B
made only one error.DeepSeek-R1 8B
produced long, unnecessary reasoning and syntax errors, even for simple tasks.LLaMA 3.1 8B
—the parent ofDeepSeek-R1 8B
—performed better but still made one error in a dataset.
Challenges with Local Execution
- Running DeepSeek-R1 14B locally was too slow, making large-scale evaluation impractical.
- It may perform better via API or with multi-GPU setups.
A major disappointment was Mistral v3 7B
, which consistently underperformed across datasets.




Risk Management and Guardrails
While integrating agentic AI workflows, we must consider several key risks and security challenges:
⚠️ Key Risks
- Data Privacy Risks → Compliance with GDPR and the EU AI Act requires careful handling of sensitive data.
- Bias in AI Models → LLMs may generate biased results, affecting decision-making.
- LLM Hallucinations → Unreliable responses may introduce errors in generated code.
- Security Threats → An unrestricted LLM could execute harmful or malicious commands.
- Inconsistent Outputs → Even at
temperature=0
, responses may vary across runs.
🛡️ Guardrails & Mitigation Strategies
To minimize these risks, we implement the following safeguards:
✅ Prevent Harmful Code Execution
- Use static analysis to detect dangerous operations before execution.
- Restrict file system and network access in the execution
environment.
✅ Ensure Data Integrity
- Allow only read-only access to datasets.
- Implement a sandboxed
execution environment (e.g., using Docker containers).
✅ Enforce Consistency
- Cache frequent queries to improve response stability.
- Use prompt engineering techniques (e.g., Chain-of-Thought reasoning).
✅ Real-Time Monitoring & Logging
- Log all LLM-generated code for auditing.
- Apply anomaly detection to detect unexpected outputs.
By implementing these guardrails, organizations can ensure secure, reliable, and scalable AI-powered workflows.
Conclusion
In this post, we demonstrated how agentic workflows can significantly improve data analytics in the energy sector by:
- ✅ Using code-generating agents to automate data analysis.
- ✅ Leveraging Python & SQL for efficient querying.
- ✅ Ensuring data privacy by never exposing datasets to LLMs.
- ✅ Evaluating multiple LLMs for accuracy & reliability.
Stay tuned & follow us for updates!