Introduction

We began our series on Generative AI with Retrieval Augmented Generation (RAG), focusing on its applications in the energy sector.

In our first post, we explored Tavily Search as a web search tool integrated with LangChain.

While RAG is a powerful tool, more complex agentic workflows combine RAG with other agents and tools depending on specific needs.

In this blog-post we continue working with the Energy dataset from PJM Interconnection LLC, a regional transmission organization (RTO) in the United States. Serving all or parts of Delaware, Illinois, Indiana, Kentucky, Maryland, Michigan, New Jersey, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, and the District of Columbia.

We have scaled up the experiments to cover the files:

  • AEP_hourly.csv
  • COMED_hourly.csv
  • DAYTON_hourly.csv
  • DEOK_hourly.csv
  • DOM_hourly.csv
  • DUQ_hourly.csv
  • EKPC_hourly.csv
  • FE_hourly.csv
  • NI_hourly.csv
  • PJME_hourly.csv
  • PJMW_hourly.csv

Agents

Our dataset consists of datetime and consumption columns, making it ideal for an agentic workflow where LLMs generate code for analytics tasks.

In our previous RAG post, we observed that LLMs naturally generate Python code when given structured data.

To enhance control, we explicitly define system prompts with:

  1. Clear data structure definitions
  2. Rules for formatting responses as code

These prompts are then combined with user queries, allowing users to interact with the system in natural language—without any coding knowledge.

Our first version of this system is illustrated in Figure 1.

Figure 1: Architecture of Agentic Data Analyst

Code Processing

As shown in Figure 1, the process begins when a user submits a query. The LLM interprets the request and returns an answer—often containing code.

We then:

  1. Extract the generated code
  2. Pass it to an execution module
  3. Feed in our dataset as input
  4. Return the final computed result

Using Python & SQL for Querying

To simplify querying, we leverage DuckDB, which allows us to use SQL directly on local datasets.

In Figure 2, we show an example where GPT-4o generates an SQL query and its corresponding result.

User Query:

“Find the 7 highest peaks of consumption between December 1 and December 10, 2015.”

Figure 2: SQL Query with GPT-4o.

The generated SQL query is extracted, executed with DuckDB, and the results are returned.

While we find SQL-based querying effective, our focus in this blog post remains on Python-based coding agents.

Data Privacy

One major advantage of our agentic pipeline is data privacy. Unlike traditional approaches where raw data is sent to an LLM for processing, our system ensures that:

  • LLMs never receive actual data → Instead of passing the dataset, we only provide the schema details (e.g., column names and data types).
  • No sensitive information leaves our system → The LLM generates code, which is then executed locally within a controlled environment.
  • Custom column names can be used → Even if privacy policies prevent sharing real column names, we can map generic names (e.g., Column_A) to their real counterparts.

By following this approach, organizations can leverage LLMs without violating privacy regulations such as GDPR and EU AI Act.

Experiments

We conducted a simple experiment by asking the following questions:

  1. Find the highest/lowest monthly mean consumption for a given year (YYYY).
  2. Find the highest/lowest monthly standard deviation of consumption for a given year (YYYY).

These questions were applied to each year (YYYY) in all datasets listed above.

While some datasets span 2004–2018, others have shorter time frames.

Each question was appended to our system prompt, and we reset the experiment before every iteration to simulate a cold start scenario.

Future improvements could include caching responses or adding memory to optimize the pipeline.

Language Models (LMs)

AI-Generated image.

To evaluate the coding capabilities of LLMs, we tested models that were both correct and consistent while following instructions effectively. The models were divided into two categories:

Closed-Source Models

These models were accessed via APIs using LangChain, though other frameworks could also be used:

  • GPT-4o
  • GPT-4o-mini
  • Mistral-Large-Latest

Open-Weight Models

These models were run locally using Ollama, imposing computational constraints:

  • LLaMA 3.1 8B
  • CodeLlama 7B
  • CodeLlama 13B
  • Mistral v3 7B
  • Phi-4 14B
  • DeepSeek-R1 8B
  • DeepSeek-R1 14B (incomplete results)

Among these, DeepSeek-R1 8B performed less accurately than its parent model, LLaMA 3.1 8B, from which it was distilled.

While DeepSeek-R1 14B achieved perfect accuracy in a single test, we abandoned further experiments due to high computational costs.

Similarly, Phi-4 14B performed well but was relatively slow on our local setup.

Evaluations

The primary evaluation metric was accuracy—whether the generated code produced the correct numerical result.

To compute accuracy: 1. We manually calculated gold-standard statistics for comparison. 2. Each LLM-generated solution was compared against these ground truths. 3. Accuracy was measured as a percentage of correct results across all dataset years.

  • 100% Accuracy → The model produced correct results for all tested years.
  • <100% Accuracy → The model either generated incorrect results or produced code with syntax errors, preventing execution.

Each dataset corresponds to a column in Tables 1–4, with accuracy values averaged across all years.

Results

The following models achieved 100% accuracy across all datasets:

GPT-4o

GPT-4o-mini

Mistral-Large-Latest

Phi-4 14B

CodeLlama 13B

LLaMA 3.1 8B

For other models, accuracy varied:

  • CodeLlama 7B made only one error.
  • DeepSeek-R1 8B produced long, unnecessary reasoning and syntax errors, even for simple tasks.
  • LLaMA 3.1 8B—the parent of DeepSeek-R1 8B—performed better but still made one error in a dataset.

Challenges with Local Execution

  • Running DeepSeek-R1 14B locally was too slow, making large-scale evaluation impractical.
  • It may perform better via API or with multi-GPU setups.

A major disappointment was Mistral v3 7B, which consistently underperformed across datasets.

Table 1: Average Accuracy for the highest mean consumption over all datasets.

Table 2: Average Accuracy for the lowest mean consumption over all datasets.

Table 3: Average Accuracy for the highest standard deviation of consumption over all datasets.

Table 4: Average Accuracy for the lowest standard deviation of consumption over all datasets.

Risk Management and Guardrails

While integrating agentic AI workflows, we must consider several key risks and security challenges:

⚠️ Key Risks

  1. Data Privacy Risks → Compliance with GDPR and the EU AI Act requires careful handling of sensitive data.
  2. Bias in AI Models → LLMs may generate biased results, affecting decision-making.
  3. LLM Hallucinations → Unreliable responses may introduce errors in generated code.
  4. Security Threats → An unrestricted LLM could execute harmful or malicious commands.
  5. Inconsistent Outputs → Even at temperature=0, responses may vary across runs.

🛡️ Guardrails & Mitigation Strategies

To minimize these risks, we implement the following safeguards:

Prevent Harmful Code Execution

  • Use static analysis to detect dangerous operations before execution.
  • Restrict file system and network access in the execution
    environment.

Ensure Data Integrity

  • Allow only read-only access to datasets.
  • Implement a sandboxed
    execution environment (e.g., using Docker containers).

Enforce Consistency

  • Cache frequent queries to improve response stability.
  • Use prompt engineering techniques (e.g., Chain-of-Thought reasoning).

Real-Time Monitoring & Logging

  • Log all LLM-generated code for auditing.
  • Apply anomaly detection to detect unexpected outputs.

By implementing these guardrails, organizations can ensure secure, reliable, and scalable AI-powered workflows.

Conclusion

In this post, we demonstrated how agentic workflows can significantly improve data analytics in the energy sector by:

  • Using code-generating agents to automate data analysis.
  • Leveraging Python & SQL for efficient querying.
  • Ensuring data privacy by never exposing datasets to LLMs.
  • Evaluating multiple LLMs for accuracy & reliability.

Stay tuned & follow us for updates!

Written by Gilberto Batres-Estrada
Senior Data Scientist and AI engineer at Helicon

Want to know more about how we can work together and launch a successful digital energy service?