
Introduction
In our last blog post, we introduced Retrieval-Augmented Generation (RAG) as a solution to Large Language Model (LLM) hallucinations. RAG enables LLMs to access external knowledge and deliver more up-to-date, reliable answers.
This post focuses on applying RAG in the Energy Sector. For this, we used the Hourly Energy Consumption dataset, published on Kaggle:
Hourly Energy Consumption Dataset.
The dataset was produced by PJM Interconnection LLC, a regional transmission organization (RTO) in the United States. Serving all or parts of Delaware, Illinois, Indiana, Kentucky, Maryland, Michigan, New Jersey, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West
Virginia, and the District of Columbia.
We show an image of the dataset below, it shows the common patterns of power consumption, with both seasonality and time of day variations.
from data_science_blog.utils.utils import get_repo_root
from data_science_blog.utils import langchain_utils as lu
import pandas as pd
PATH = f"{get_repo_root()}/data/HourlyEnergyConsumption/AEP_hourly.csv"df = pd.read_csv(PATH)
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.set_index('Datetime')
df = df.sort_index()
display(df.head(2))
df['2016-01-01':].plot();
Datetime | AEP_MW |
2004-10-01 01:00:00 | 12379.0 |
2004-10-01 02:00:00 | 11935.0 |

Disclaimer
Due to space constraints, this blog post does not include all code snippets necessary to reproduce our results. However, we hope it inspires you to develop your own RAG pipeline.
Goals
Our goal is to enable LLMs to answer simple quantitative questions about energy consumption data. We simulate a scenario where an employee at PJM interacts with a RAG system to extract insights, similar to the system proposed in (Lewis 2021).

RAG: Image from the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks The image represents a Query Encoder + Document Index with a pre-trained sequence-to-sequence model (seq2seq), called a Generator, fine-tuned end-to-end (Lewis 2021).
Initial Experiments
We selected the AEP_hourly
dataset, which represents consumption data for the American Electric Power (in MW).
To start we ask our LLM the following question:
Please find the highest monthly mean consumption for the year 2018. Please be as specific as possible
.
We noted that by passing our file to GPT-4o
in the web app, the problem is solved and we get the right answer. One reason is that ChatGPT
, has the ability to use tools and therefore can run code.
Vanilla LLMs
In our tests we refer to Vanilla LLMs when we use a Language Model (LM) out of the box without any changes to the system. We tested both GPT-4o
and LLama 3.1
from OpenAI and Meta AI respectively.
As a first test we ask the above question to both GPT-4o
and LLama 3.1
. We present only the answer provided by GPT-4o
. Note that a very similar response was provided by LLama 3.1
. We see the response in the window below.

Qualitative Analysis of LLM Outputs
GPT-4o
returns a complete Python based answer but cannot compute or return a simple quantitative answer. One reason is that we have not provided the system with any code interpreter to run the code. Llama 3.1
(8B parameters) returns a similar solution as GPT-4o
. In this case both models behave very similar.
Data Processing
After running many experiments, we came to the conclusion that either we treat our RAG system as an agent that can make use of tools and run its own code or we need to help the system with preprocessing. We chose the second option and computed monthly statistics on the data. In this case we both know the right answer, and we can ask well defined questions.
There is also a benefit of doing the computations beforehand and only letting the system retrieve documents and answer questions and not running code which consume resources. There exists a trade-off between the cost of tokens produced by the LLM and running its own code and doing computations before hand. One needs to experiment and see which setup that works best. At the same time we loose flexibility if we do precomputations as we assume we know which questions can be asked in the future.
We created a function preprocess_csv
for computing the mean, standard deviation, the Min and the Max for every month. We also convert datetime data to text so that the LLM can make use of it, an example is shown below:
df_monthly_stats = lu.preprocess_csv(PATH, period='ME')
display(df_monthly_stats.head(2))
Datetime | count | mean | std | min | max | min_timestamp | min_datetime_text | max_timestamp | max_datetime_text | Week | Month_text | Year | |
0 | 2004-10-31 | 742 | 13947.54 | 1579.27 | 10263.0 | 17003.0 | 2004-10-31 04:00:00 | October 31, 2004 at 04:00 AM | 2004-10-18 20:00:00 | October 18, 2004 at 08:00 PM | 44 | October | 2004 |
1 | 2004-11-30 | 720 | 14830.44 | 1598.46 | 10998.0 | 18388.0 | 2004-11-21 05:00:00 | November 21, 2004 at 05:00 AM | 2004-11-30 19:00:00 | November 30, 2004 at 07:00 PM | 48 | November | 2004 |
In LangChain LLMs expect Document
datastructure as input. A sample of the data used and how we structure it is shown below.
[Document(metadata={'Month': 'October', 'Year': 2004, 'source_file': 'AEP_hourly.csv'}, page_content='October of year 2004: mean=13947.537735849057, std=1579.2747641323742, min=10263.0, max=17003.0, count=742.'),
Document(metadata={'Month': 'November', 'Year': 2004, 'source_file': 'AEP_hourly.csv'}, page_content='November of year 2004: mean=14830.440277777778, std=1598.4624764130738, min=10998.0, max=18388.0, count=720.'),
Document(metadata={'Month': 'December', 'Year': 2004, 'source_file': 'AEP_hourly.csv'}, page_content='December of year 2004: mean=16737.720430107525, std=1942.5144247883245, min=12517.0, max=22577.0, count=744.'),...]
In this case we end up with a list containing Document
-containers. A Document
has metadata and page-content with the text that is part of the context to the LLM. The metadata helps the system to retrieve relevant documents that are similar to our requests.
RAG-Pipeline
Components in RAG Systems
RAG systems consist of two main components:
- Retriever: Fetches relevant documents.
- Generator: Produces answers using the retrieved context.
We tested different retrievers (e.g., Ollama
vs. OpenAI
embeddings) and observed that OpenAI embeddings outperformed the other alternatives for our use case, using the default values for the system.
Key Parameters for Optimization:
It is possible to implement hyperparameter search, for instance by changing:
- Embedding model
- Chunk size & overlap
- Vectorstore
Evaluation
Our RAG-system can choose FAISS
or Chroma
as the vector store andLlama 3
or OpenAIEmbeddings
as the embedding part. This means that our retriever alone has 4 different variations and the number of variations increases as more components are added.
We did a simple selection based on qualitative analysis, where we compared the responses that we got from the system. We found that with the default values the system performs better if we use Chroma
and OpenAIEmbeddings
.
As a Generator, we chose between Llama 3.1
, GPT-4o
and GPT-4o-mini
. We show the results in the tables below. The models’ responses are presented as Correct
or Incorrect
if the content is similar to our gold responses. As we worked with a small amount of data, we could do the evaluation in a qualitatively manner.
Responses with LLama 3.1
Datetime | True Label | RAG-System |
December 2004 | 16737.72 | ✅ Correct |
December 2005 | 17548.59 | ✅ Correct |
August 2006 | 17400.20 | ✅ Correct |
February 2007 | 19212.90 | ❌ Incorrect |
January 2008 | 18574.76 | ✅ Correct |
January 2009 | 18653.02 | ❌ Incorrect |
December 2010 | 18389.12 | ✅ Correct |
January 2011 | 18314.49 | ✅ Correct |
July 2012 | 17299.70 | ✅ Correct |
February 2013 | 16946.72 | ✅ Correct |
January 2014 | 18449.32 | ✅ Correct |
February 2015 | 18407.08 | ✅ Correct |
August 2016 | 16765.30 | ✅ Correct |
December 2017 | 16094.73 | ✅ Correct |
January 2018 | 17594.88 | ✅ Correct |
Table 1: Gold vs RAG-system response. LLama 3.1
is the generator and the retriever consists of the Chroma
+ OpenAIEmbeddings
combination.
Responses with GPT-4o-mini
Datetime | True Label | RAG-System |
December 2004 | 16737.72 | ✅ Correct |
December 2005 | 17548.59 | ✅ Correct |
August 2006 | 17400.20 | ❌ Incorrect |
February 2007 | 19212.90 | ❌ Incorrect |
January 2008 | 18574.76 | ❌ Incorrect |
January 2009 | 18653.02 | ❌ Incorrect |
December 2010 | 18389.12 | ✅ Correct |
January 2011 | 18314.49 | ✅ Correct |
July 2012 | 17299.70 | ✅ Correct |
February 2013 | 16946.72 | ❌ Incorrect |
January 2014 | 18449.32 | ✅ Correct |
February 2015 | 18407.08 | ❌ Incorrect |
August 2016 | 16765.30 | ✅ Correct |
December 2017 | 16094.73 | ✅ Correct |
January 2018 | 17594.88 | ❌ Incorrect |
Table 2: Gold vs RAG-system response. GPT-4o-mini
is the generator and the retriever consists of the Chroma
+ OpenAIEmbeddings
combination.
Responses with GPT-4o
Datetime | True Label | RAG-System |
December 2004 | 16737.72 | ✅ Correct |
December 2005 | 17548.59 | ✅ Correct |
August 2006 | 17400.20 | ✅ Correct |
February 2007 | 19212.90 | ✅ Correct |
January 2008 | 18574.76 | ✅ Correct |
January 2009 | 18653.02 | ✅ Correct |
December 2010 | 18389.12 | ✅ Correct |
January 2011 | 18314.49 | ✅ Correct |
July 2012 | 17299.70 | ✅ Correct |
February 2013 | 16946.72 | ✅ Correct |
January 2014 | 18449.32 | ✅ Correct |
February 2015 | 18407.08 | ✅ Correct |
August 2016 | 16765.30 | ✅ Correct |
December 2017 | 16094.73 | ✅ Correct |
January 2018 | 17594.88 | ✅ Correct |
Table 3: Gold vs RAG-system response. GPT-4o
is the generator and the retriever consists of the Chroma
+ OpenAIEmbeddings
combination.
Below we show an example answer from the best system,GPT-4o
, and noted that its answers tend to be short and concise.

Because of the random nature of output generation from LLMs, different runs can give slightly different responses. The error rate varied between 2 and 4 errors for LLama 3.1
. For GPT-4o-mini
we got the same number of errors (7) but different questions were erroneous. ForGPT-4o
we did not see any variations.
We can summarize the results as follows:
- Table: 1 shows the results using
LLama 3.1
as the Generator, its accuracy varied with between 2 and 4 errors out of 15 questions resulting in a accuracy range 73.3% - 86.6%. - Table 2:
GPT-4o-mini
as the Generator produces 7 errors resulting in an accuracy of 53.3%. - Table 3: The best combination in our setup was produced by
GPT-4o
, which scored 100% in accuracy and as mentioned before it answers short and concise.
In order to improve a RAG-system it is important to evaluate the different components that make up the system. Below we present a short list of metrics that can be used.
Retrieval Metrics
- Precision@k: Proportion of relevant documents among the top k retrieved.
- Recall@k: Proportion of all relevant documents retrieved.
- F1-Score: Balances precision and recall.
Generation Metrics
- Evaluating the output quality requires metrics like BLEU, ROUGE, or GPTScore, depending on the task complexity.
- Exact Match
- Reasoning Accuracy
In our case we could perform a qualitative analysis and compared the Gold responses with the outputs from the RAG-system. We also tested an evaluator using GPT-4o
, where the evaluator compared the output with the Gold responses. In our tests we could verify that our evaluator was
correct.
Conclusion
In our qualitative analysis we found that opting for Chroma
+ OpenAIEmbeddings
gave the best results, but we cannot rule out that other combinations can work as well. As we mentioned before, we used the default values that LangChain provides in their API. The hyperparameters can be changed using hyperparameter search or another type of algorithm.
Takeaways
Through our exploration of a RAG pipeline for answering quantitative questions in the energy sector, we’ve identified several key insights:
- Preprocessing Data: Preparing and structuring data in advance enhances efficiency and ensures more reliable results.
- The Importance of Experimentation: Iterative testing is crucial for optimization. This includes refining embeddings, exploring various chunking strategies, and evaluating different retrievers.
- Balancing Trade-offs: Deciding between on-demand computations and precomputed data is a critical design choice.
- Precomputed data can reduce latency and conserve resources, but it may limit flexibility.
As AI adoption continues to grow, workflows like this will be essential tools for data scientists integrating LLMs into business processes.
References
HuggingFace. 2022. “HuggingFace.” https://huggingface.co/docs/trl/index.
Lewis, Patrick et al. 2021. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Https://Arxiv.org/Pdf/2005.11401.
Tunstall, Lewis et al. 2022. Natural Language Processing with Transformers. O’Reilly Media.