Introduction

In our last blog post, we introduced Retrieval-Augmented Generation (RAG) as a solution to Large Language Model (LLM) hallucinations. RAG enables LLMs to access external knowledge and deliver more up-to-date, reliable answers.

This post focuses on applying RAG in the Energy Sector. For this, we used the Hourly Energy Consumption dataset, published on Kaggle:

Hourly Energy Consumption Dataset
.

The dataset was produced by PJM Interconnection LLC, a regional transmission organization (RTO) in the United States. Serving all or parts of Delaware, Illinois, Indiana, Kentucky, Maryland, Michigan, New Jersey, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West
Virginia, and the District of Columbia.

We show an image of the dataset below, it shows the common patterns of power consumption, with both seasonality and time of day variations.

from data_science_blog.utils.utils import get_repo_root
from data_science_blog.utils import langchain_utils as lu
import pandas as pd
PATH = f"{get_repo_root()}/data/HourlyEnergyConsumption/AEP_hourly.csv"df = pd.read_csv(PATH)
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.set_index('Datetime')
df = df.sort_index()
display(df.head(2))
df['2016-01-01':].plot();

DatetimeAEP_MW
2004-10-01 01:00:0012379.0
2004-10-01 02:00:0011935.0

Disclaimer

Due to space constraints, this blog post does not include all code snippets necessary to reproduce our results. However, we hope it inspires you to develop your own RAG pipeline.

Goals

Our goal is to enable LLMs to answer simple quantitative questions about energy consumption data. We simulate a scenario where an employee at PJM interacts with a RAG system to extract insights, similar to the system proposed in (Lewis 2021).

RAG: Image from the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks The image represents a Query Encoder + Document Index with a pre-trained sequence-to-sequence model (seq2seq), called a Generator, fine-tuned end-to-end (Lewis 2021).

Initial Experiments

We selected the AEP_hourly dataset, which represents consumption data for the American Electric Power (in MW).

To start we ask our LLM the following question:

Please find the highest monthly mean consumption for the year 2018. Please be as specific as possible.

We noted that by passing our file to GPT-4o in the web app, the problem is solved and we get the right answer. One reason is that ChatGPT, has the ability to use tools and therefore can run code.

Vanilla LLMs

In our tests we refer to Vanilla LLMs when we use a Language Model (LM) out of the box without any changes to the system. We tested both GPT-4o and LLama 3.1 from OpenAI and Meta AI respectively.

As a first test we ask the above question to both GPT-4o and LLama 3.1. We present only the answer provided by GPT-4o. Note that a very similar response was provided by LLama 3.1. We see the response in the window below.

Qualitative Analysis of LLM Outputs

GPT-4o returns a complete Python based answer but cannot compute or return a simple quantitative answer. One reason is that we have not provided the system with any code interpreter to run the code. Llama 3.1 (8B parameters) returns a similar solution as GPT-4o. In this case both models behave very similar.

Data Processing

After running many experiments, we came to the conclusion that either we treat our RAG system as an agent that can make use of tools and run its own code or we need to help the system with preprocessing. We chose the second option and computed monthly statistics on the data. In this case we both know the right answer, and we can ask well defined questions.

There is also a benefit of doing the computations beforehand and only letting the system retrieve documents and answer questions and not running code which consume resources. There exists a trade-off between the cost of tokens produced by the LLM and running its own code and doing computations before hand. One needs to experiment and see which setup that works best. At the same time we loose flexibility if we do precomputations as we assume we know which questions can be asked in the future.

We created a function preprocess_csv for computing the mean, standard deviation, the Min and the Max for every month. We also convert datetime data to text so that the LLM can make use of it, an example is shown below:

df_monthly_stats = lu.preprocess_csv(PATH, period='ME')
display(df_monthly_stats.head(2))
Datetimecountmeanstdminmaxmin_timestampmin_datetime_textmax_timestampmax_datetime_textWeekMonth_textYear
02004-10-3174213947.541579.2710263.017003.02004-10-31 04:00:00October 31, 2004 at 04:00 AM2004-10-18 20:00:00October 18, 2004 at 08:00 PM44October2004
12004-11-3072014830.441598.4610998.018388.02004-11-21 05:00:00November 21, 2004 at 05:00 AM2004-11-30 19:00:00November 30, 2004 at 07:00 PM48November2004

In LangChain LLMs expect Document datastructure as input. A sample of the data used and how we structure it is shown below.

[Document(metadata={'Month': 'October', 'Year': 2004, 'source_file': 'AEP_hourly.csv'}, page_content='October of year 2004: mean=13947.537735849057, std=1579.2747641323742, min=10263.0, max=17003.0, count=742.'),
 Document(metadata={'Month': 'November', 'Year': 2004, 'source_file': 'AEP_hourly.csv'}, page_content='November of year 2004: mean=14830.440277777778, std=1598.4624764130738, min=10998.0, max=18388.0, count=720.'),
 Document(metadata={'Month': 'December', 'Year': 2004, 'source_file': 'AEP_hourly.csv'}, page_content='December of year 2004: mean=16737.720430107525, std=1942.5144247883245, min=12517.0, max=22577.0, count=744.'),...]

In this case we end up with a list containing Document-containers. A Document has metadata and page-content with the text that is part of the context to the LLM. The metadata helps the system to retrieve relevant documents that are similar to our requests.

RAG-Pipeline

Components in RAG Systems

RAG systems consist of two main components:

  1. Retriever: Fetches relevant documents.
  2. Generator: Produces answers using the retrieved context.

We tested different retrievers (e.g., Ollama vs. OpenAI embeddings) and observed that OpenAI embeddings outperformed the other alternatives for our use case, using the default values for the system.

Key Parameters for Optimization:

It is possible to implement hyperparameter search, for instance by changing:

  • Embedding model
  • Chunk size & overlap
  • Vectorstore

Evaluation

Our RAG-system can choose FAISS or Chroma as the vector store andLlama 3 or OpenAIEmbeddings as the embedding part. This means that our retriever alone has 4 different variations and the number of variations increases as more components are added.

We did a simple selection based on qualitative analysis, where we compared the responses that we got from the system. We found that with the default values the system performs better if we use Chroma and OpenAIEmbeddings.

As a Generator, we chose between Llama 3.1, GPT-4o and GPT-4o-mini. We show the results in the tables below. The models’ responses are presented as Correct or Incorrect if the content is similar to our gold responses. As we worked with a small amount of data, we could do the evaluation in a qualitatively manner.

Responses with LLama 3.1

DatetimeTrue LabelRAG-System
December 200416737.72✅ Correct
December 200517548.59✅ Correct
August 200617400.20✅ Correct
February 200719212.90❌ Incorrect
January 200818574.76✅ Correct
January 200918653.02❌ Incorrect
December 201018389.12✅ Correct
January 201118314.49✅ Correct
July 201217299.70✅ Correct
February 201316946.72✅ Correct
January 201418449.32✅ Correct
February 201518407.08✅ Correct
August 201616765.30✅ Correct
December 201716094.73✅ Correct
January 201817594.88✅ Correct

Table 1: Gold vs RAG-system response. LLama 3.1 is the generator and the retriever consists of the Chroma + OpenAIEmbeddings combination.

Responses with GPT-4o-mini

DatetimeTrue LabelRAG-System
December 200416737.72✅ Correct
December 200517548.59✅ Correct
August 200617400.20❌ Incorrect
February 200719212.90❌ Incorrect
January 200818574.76❌ Incorrect
January 200918653.02❌ Incorrect
December 201018389.12✅ Correct
January 201118314.49✅ Correct
July 201217299.70✅ Correct
February 201316946.72❌ Incorrect
January 201418449.32✅ Correct
February 201518407.08❌ Incorrect
August 201616765.30✅ Correct
December 201716094.73✅ Correct
January 201817594.88❌ Incorrect

Table 2: Gold vs RAG-system response. GPT-4o-mini is the generator and the retriever consists of the Chroma + OpenAIEmbeddings combination.

Responses with GPT-4o

DatetimeTrue LabelRAG-System
December 200416737.72✅ Correct
December 200517548.59✅ Correct
August 200617400.20✅ Correct
February 200719212.90✅ Correct
January 200818574.76✅ Correct
January 200918653.02✅ Correct
December 201018389.12✅ Correct
January 201118314.49✅ Correct
July 201217299.70✅ Correct
February 201316946.72✅ Correct
January 201418449.32✅ Correct
February 201518407.08✅ Correct
August 201616765.30✅ Correct
December 201716094.73✅ Correct
January 201817594.88✅ Correct

Table 3: Gold vs RAG-system response. GPT-4o is the generator and the retriever consists of the Chroma + OpenAIEmbeddings combination.

Below we show an example answer from the best system,GPT-4o, and noted that its answers tend to be short and concise.

Because of the random nature of output generation from LLMs, different runs can give slightly different responses. The error rate varied between 2 and 4 errors for LLama 3.1. For GPT-4o-mini we got the same number of errors (7) but different questions were erroneous. For
GPT-4o we did not see any variations.

We can summarize the results as follows:

  • Table: 1 shows the results using LLama 3.1 as the Generator, its accuracy varied with between 2 and 4 errors out of 15 questions resulting in a accuracy range 73.3% - 86.6%.
  • Table 2: GPT-4o-mini as the Generator produces 7 errors resulting in an accuracy of 53.3%.
  • Table 3: The best combination in our setup was produced by GPT-4o, which scored 100% in accuracy and as mentioned before it answers short and concise.

In order to improve a RAG-system it is important to evaluate the different components that make up the system. Below we present a short list of metrics that can be used.

Retrieval Metrics

  • Precision@k: Proportion of relevant documents among the top k retrieved.
  • Recall@k: Proportion of all relevant documents retrieved.
  • F1-Score: Balances precision and recall.

Generation Metrics

  • Evaluating the output quality requires metrics like BLEU, ROUGE, or GPTScore, depending on the task complexity.
  • Exact Match
  • Reasoning Accuracy

In our case we could perform a qualitative analysis and compared the Gold responses with the outputs from the RAG-system. We also tested an evaluator using GPT-4o, where the evaluator compared the output with the Gold responses. In our tests we could verify that our evaluator was
correct.

Conclusion

In our qualitative analysis we found that opting for Chroma + OpenAIEmbeddings gave the best results, but we cannot rule out that other combinations can work as well. As we mentioned before, we used the default values that LangChain provides in their API. The hyperparameters can be changed using hyperparameter search or another type of algorithm.

Takeaways

Through our exploration of a RAG pipeline for answering quantitative questions in the energy sector, we’ve identified several key insights:

  1. Preprocessing Data: Preparing and structuring data in advance enhances efficiency and ensures more reliable results.
  2. The Importance of Experimentation: Iterative testing is crucial for optimization. This includes refining embeddings, exploring various chunking strategies, and evaluating different retrievers.
  3. Balancing Trade-offs: Deciding between on-demand computations and precomputed data is a critical design choice.
  4. Precomputed data can reduce latency and conserve resources, but it may limit flexibility.

As AI adoption continues to grow, workflows like this will be essential tools for data scientists integrating LLMs into business processes.

References

HuggingFace. 2022. “HuggingFace.” https://huggingface.co/docs/trl/index.

Lewis, Patrick et al. 2021. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Https://Arxiv.org/Pdf/2005.11401.

Tunstall, Lewis et al. 2022. Natural Language Processing with Transformers. O’Reilly Media.

Written by Gilberto Batres-Estrada
Senior Data Scientist and AI engineer at Helicon

Want to know more about how we can work together and launch a successful digital energy service?