Data for Change

Predicting energy demand with neural networks

How we won a datathon w/ a simple neural network w/ extensive lit review, feature engineering & feature selection using XGBoost

Min Htoo Lin

Published in

Towards Data Science

24 min readJul 16, 2020

Written by: Chua Chiah Soon, Li Zhaochen, Lin Min Htoo, Quah Jia Yong, all NTU students from Singapore.

This article summarises the approach we used to win the Deep Learning Datathon jointly organised by Nanyang Technological University Singapore & ai4impact. For any queries, do reach out to us on LinkedIn.

Here is an outline of our article:

Introduction
Objective & Metrics of success
Exploratory Data Analysis (EDA)
Data Cleaning
Research methodology
Convert 1d time to 2d time
Windowing
XGBoost SHAP Feature Importance Values
Results & Discussion: Features
Results & Discussion: Model
Our model’s strengths & weaknesses
A note on interpretable machine learning
Conclusion

Introduction

Matching electrical energy consumption with the right level of supply is crucial, because excess electricity supplied cannot be stored, unless converted to other forms, which incurs additional costs and resources. At the same time, underestimating energy consumption could be fatal, with excess demand overloading the supply line and even causing blackouts. Clearly, there are tangible benefits in closely monitoring the energy consumption of buildings — be they office, commercial or household.

With the advent of machine learning, accurately predicting future energy consumption becomes increasingly possible. Accurate predictions provide two-fold benefits: first, managers gain key insights into factors affecting their building’s energy demand, providing opportunities to address them and improve energy efficiency. Secondly, forecasts provide a benchmark to single out anomalously high/low energy consumption and alert managers to faults within the building.

However, the difficulty lies in the non-linearity and volatility of real-time energy usage, which is highly susceptible to changes in external factors. For instance, ambient temperature is known to significantly influence a building’s energy demand via heating and air-conditioning [1]. Furthermore, there can be unexpected surges and drops in energy consumption due to equipment failure, supply failure, or simply random fluctuations that are difficult to be explained.

Objective and metrics of success

Our task was to predict a building’s energy consumption 1 day ahead of time based on 2-year historical energy demand data provided in 15-minute intervals from July 2014 to May 2016. In addition, we were given temperature data from 4 locations of varying (undisclosed) distances from the building, in the order wx1 (nearest), wx2, wx3 and wx4 (farthest). We used a simple Artificial Neural Network (ANN, aka Multi-Layer Perceptron) as it is capable of capturing complex, non-linear relationships between diverse numerical data, and relatively fast to build and train, compared to more sophisticated architectures like Long Short-Term Memory networks (LSTMs).

We used two metrics to evaluate our model: Mean Squared Error (MSE) and Lag. Mean Squared Error measures the average of the squares of errors:

Accordingly, we used MSE Loss to optimise our ANN. The strength of MSE is that it punishes larger errors due to its squared nature, reducing our model’s likelihood of making extreme predictions which would be costly or even dangerous. Minimally, our model must achieve lower MSE than persistence, a trivial benchmark forecast where the ” predicted future value 1 day ahead = observed present value”. Persistence is a good benchmark to start with because of the highly periodic nature of energy consumption [2].

As for lag, our goal is a peak lag of 0 between our predictions and actual energy consumption values, where our model, on average, would not be delayed in its predictions and can capture changes in energy consumption on time.

Our workflow to this problem involved:

Extensive literature review, data visualisation and analysis
Pre-processing, data cleaning and feature engineering in Python
Exporting the data into AutoCaffe to train and evaluate the neural network.

A video summary of our approach is available at: https://www.youtube.com/watch?v=dTUU9urBUoE

Exploratory Data Analysis (EDA)

As the dataset given is anonymised with minimal context, we first scrutinised it to gain a comprehensive intuition for effective feature engineering. Arguably, this is the most important step of any machine learning project, and we spent close to an entire week (out of ~2 weeks) on this, as firm believers of ‘garbage in, garbage out’. We discovered that both the energy and temperature data contain a non-trivial amount of missing values, necessitating an effective method of filling those values. Further, wx4 has very sparse data (only containing data from 2016), so it is unlikely for us to make use of it.

Firstly, we plotted the energy data in 2015, the year with the most complete data, unlike 2014 and 2016. Mean monthly values were superimposed to offer clearer overview of trends across months.

Figure 1: Time series of energy consumption (red) and temperature(blue) across 2015. Wx3 was used because it was shown to have the highest correlation with the energy consumption data.

As seen in the graph, temperature around the building ranges from sub 0 to 30 °C; given that cold months are from December to February and warm months from June to August, the building should be from the Northern Hemisphere with latitude >30°. Interestingly, two local maxima of energy consumption exist, occurring at the two tail ends of temperature: once during the coldest months, and again during the hottest month (July), suggesting that air-conditioning and heating are significant drivers of energy demand. Across the year, we identified 3 different energy-temperature regimes:

Winter, December to February: Frequent and large energy fluctuations, with relatively large mean consumption. Temperature is generally below 10 °C.

Summer: June to August: Frequent but smaller energy variations compared to winter, steadily increasing with temperature. Temperature is generally above 20 °C.

Transition: March to May & September to November: Relatively constant and stable energy consumption. Temperature ranges from 10 to 20 °C.

This analysis inspired two dummy variables (values either 1 for True or 0 for False): 1) is_season_winter & 2) is_season_transition to facilitate better learning of the neural network. Note that an is_summer column will have value 1 (True) whenever both is_winter & is_summer columns are 0 (False); thus, we dropped the is_summer column to avoid the Dummy Variable Trap [3] where one variable can be straightforwardly inferred from one or more other variables leading to multicollinearity issues.

Moving on, we plotted the time series of energy consumption over the entire time frame available.

Figure 2: Yearly trends in energy consumption. Anomalously low energy consumption values from July 2014 to October 2014 stand out

We realised that energy consumption for July-Oct 2014 was anomalously low. There could be a variety of reasons: the building could be newly built and slowly ramping up operations (hence not full load) or undergoing maintenance. While discarding data is normally discouraged, we decided to do so here as clearly anomalous data would hurt more than help our model which relies on historical data to make predictions. Thus, our energy data begins from 29 October 2014. With a train/test split of 70/30, our test data begins on 7 December 2015, and the training set covers a full year of data, which sufficiently exposes our model to all seasons acrossa year.

We then visualised energy consumption across different days of the week. We calculated the mean, max and min consumption for each day of the week for the entire year, excluding public holidays first.

Figure 3: Mean, maximum and minimum energy consumption values for each day of the week for 2015. Notice how energy consumption is lower over the weekends.

We observed that energy consumption was significantly lower during the weekends, implying the building is likely an office building — busy on weekdays, empty on weekends, rather than a shopping mall or a library. To exploit this pattern, we created a dummy variable on whether the day being predicted for was a weekend, called is_weekend.

Next, we plotted the distribution of energy consumption for each month, categorised into weekdays, weekends and public holidays.

Figure 4: Distribution of energy consumption based on type of day, across the different months. This provided us with ideas for feature engineering and customized method for filling missing data.

Firstly, energy consumption on weekdays were clearly higher than on weekends and public holidays in general. Secondly, while there are significant counts of anomalously high energy demand on weekends, the general distribution of energy consumption for weekends and public holidays are very similar. This means we should interpolate a weekday public holiday with energy consumption from the previous weekend/public holiday (whichever is nearer).

Next, we scrutinized daily consumption patterns.

Figure 5: Daily energy consumption for the month of July.

Generally, on weekdays, energy consumption picks up sharply at 7 am and drops off sharply after 6 pm, most likely the standard working hours of that building. Note that some of the plots look strangely shaped because of missing values, which further illustrates the need to fill these gaps. Zooming into a particular day, we found that on average, there is a noticeable drop in energy consumption around 12 pm, which we attribute to office lunchtime hours.

Figure 6: A closer analysis on daily trends suggests a local minima of energy consumption at 12 pm is commonly seen, which we attribute to lunch hour.

As such, we decided to introduce the dummy variables: is_lunchtime (when hour = 12 on weekdays that are not public holidays) and is_working hours (Between 7 am and 6 pm on weekdays that are not public holidays).

Moving on, we plotted an autocorrelation plot of energy consumption to identify cyclical patterns backed by statistical analysis rather than ‘eye-balling’.

Figure 7: Autocorrelation plot of energy consumption for 672 timesteps (1 week). Notice the peaks occur at intervals of 96 (24 hrs).

Figure 8 Autocorrelation plot of energy consumption for 2880 timesteps (1 month). Notice that 672 timesteps has an even stronger autocorrelation coefficient than 96 timesteps.

As the data was given in 15-minute intervals, 24 hours apart corresponds to 96 timesteps, and 12 hours to 48 timesteps etc. Energy consumption for a particular hour each day was most strongly correlated to the same hour of the day before. This relationship weakens as the number of days increases but peaks again at 672 timesteps or 1 week apart, which in fact has stronger correlation than 1 day apart. On the other hand, autocorrelation was the weakest 12 hours apart. This hinted to us that strong predictive features may include T:-576 (6 days ago from current time, but 1 week ago from time being predicted for), T:0 (1 day ago from time being predicted) and T:-96 (2 days ago from time being predicted).

Next, we delved into the relationship between energy demand and temperature, and attempted to fit a polynomial trend, inspired by Valor et al 2001 [1].

Figure 9: The relationship between energy consumption and temperature is non-linear, as seen by the scatterplot.

While the best fit lines are clearly not ideal, the scatterplots still revealed useful insights. At the tail ends of temperature (too hot or too cold), energy consumption tends to rise, most likely due to increased air conditioning or heating respectively. Moreover, the relationship is unlikely to be purely linear, showing hints of a quadratic one with a ‘most comfortable’ temperature at about 19 °C. To explore this further, we crafted a correlation heatmap using the Python Seaborn library, dividing the data into winter, summer and transition months.

Figure 10: A correlation heatmap of temperature and square of temperature with energy consumption

Firstly, we observed that wx3 has noticeably higher absolute correlation value across all periods (0.29 vs 0.24), winter (0.051 vs 0.0079 & 0.016) and transition months (0.17 vs 0.11). In summer, it was slightly lower (0.43 vs 0.46) than wx1 and wx2. This was also confirmed by preliminary investigations with feature importance values in XGBoost, which consistently ranked wx3 at T+96 higher than that of wx1 or wx2. Thus, we focused on creating windowed features for temperature mostly off wx3.

Next, a quadratic relationship seemed to slightly outperform the linear one with a higher absolute correlation values in summer (for all 3 temperatures) and winter (for wx3). Therefore, on top of raw energy values, the squared value of wx3 at T+96 (time being predicted) might be a useful feature to consider.

Lastly, in winter, both raw and squared temperature have very poor correlation with energy. One reason might be that in winter, the building’s heating systems may be indefinitely maintained at a fixed level regardless of large fluctuations in temperature.

Data Cleaning

We conducted data pre-processing in Python instead of AutoCaffe because our team is more proficient in Python libraries than Smojo, and we wanted more control to create specific features like dummy variables. A general outline of the pre-processing pipeline involved aligning temperature data to 15-minute intervals, interpolation and normalisation.

Approximately 17% of the energy dataset is missing from 2014–10–29 00:00:00 to 2016–05–26 20:15:00. We first tried linear interpolation but we realised that this led to data leakage as calculating the mean involves data from future (after the missing timestamp). We also noted that missing values usually occur for the whole period of up to two days, where linear interpolation fails to capture the inherent seasonality in energy demand due to factors like temperature and working hours. We then tried simple forward filling, where the missing data is replaced by the data exactly 24 hours before. However, this did not reflect the weekly trend well, because of the difference between weekdays and weekends/public holidays.

Therefore, we implemented a customized filling method, considering the type of day (public holiday/weekend/weekday). If the missing value day is not a public holiday, the missing value would be replaced with the value exactly one week before (given that this is not missing too! Which luckily does not happen in this dataset). Otherwise, if the missing value day is a public holiday, it would be replaced with the nearest, previous weekend value.

Figure 11: Top: comparing linear interpolation vs our own customized method of filling the data. Bottom: a zoomed in example of filled energy data.

This method more accurately reflects the seasonality and is relatively easy to implement as opposed to more sophisticated methods. We did consider two improvements:

Instead of exactly copying, add a random ‘jitter’ to the values being brought forward by multiplying all the energy values from that day by a small, random factor between say, 0.8 to 1.2. This method reduces the chance of our model overfitting to the historical data as long as a suitable range for the random factor is chosen. This random jitter would only be applied to the training dataset, and not the test set. However, as we did not face a massive overfitting problem, we did not implement this idea.
Fit a neural network or a time-series forecasting algorithm that also considers temperature data to impute the missing values, as it might give even more realistic results. However, we decided this overcomplicates the task given the time constraints of the competition.

As for temperature, we had future data in the next 24 hours from weather forecasts. We chose linear interpolation because data gaps last less than an hour, so it could still capture the trend well (Figure 12). Secondly, as we are given future data, data leakage is not an issue. Interpolation was not done for wx4 as it was too sparse.

Figure 12: Top: How the temperature data looks before linear interpolation. Bottom: After linear interpolation. For temperature, as data gaps were sparse, we could use linear interpolation to produce reasonable looking results

We normalized all energy and temperature data using “MinMax scaling” into the range [0,1]. Such standardisation for all feature inputs is critical for neural networks to ensure that any differences in feature importance is solely due to the feature itself and not its numerical magnitude. We also took care to take the min/max values from the training data to prevent data leakage from the test set.

Converting 1d time to 2d time

During our literature review, we discovered a creative time data manipulation method: Moon et al 2019 [2] transformed calendar time to 2d continuous format. While calendar data like month and day have periodic properties, representing them by sequential data loses some of that periodicity. For example, 0000hrs follows right after 2359hrs, but numerically they are very far apart. Thus, Moon et al [2] utilised the following equations (EoM = end of month, which is the number of days in that month):

Figure 13: Top: equations for converting 1d time to 2d time. Left: 1d sequential representation of time. Notice even though December (month 12) comes after January (month 1), the numerical gap of 11 does not reflect this periodicity. Right: 2d transformation of time. Notice how it restores the periodicity of months with December rightly a neighbour of January.

Our test loss performance improved by at least 2.5 to 4% when we used 2d time.

Another interesting feature we wanted to explore was “public holiday /weekend inertia” proposed by Valor et al 2001 [1]. They found that energy consumption in office buildings were systematically low on working days after weekends (ie Mondays) or after public holidays, because of inertia caused by economic activity reduction from the non-working day. A feature to exploit this could be ‘days since last public holiday’ and ‘days since last weekend’. However, careful analysis of our data suggested that this “inertia” effect was not present and we did not pursue it further.

Research methodology

Our approach to generating the best possible model involved training the model in AutoCaffe and adding features one by one based on the test score and lag achieved. However, while AutoCaffe allowed for fast training, a limitation was that we could not ‘automate’ the permutation of features and had to do it manually. To minimise time spent tediously permutating features, we relied on data analysis, domain knowledge, extensive feature engineering & XGboost SHAP feature importance values to cut down the feature combinations to check.

One big assumption made is that features are independent of one another with minimal feature interaction. Given that this assumption is not likely to always hold, we also ran certain combination of features together based on our intuition and domain knowledge gained from reading the scientific literature. We were also careful to conduct sufficient repeats to reduce the variance in final test losses due to random weight initialisation.

Windowing

For preliminary investigations, we prepared a Pandas dataframe containing the raw values of energy, temperature (wx1 to wx3), datetime features (like month, day) and windowed features (like min, max, mean, range, first-order differences, mean of first-order differences, second order differences and so on).

Good windowing is crucial to help our ANN cluster the data better. Our choice of windows was guided by our understanding of the cyclical pattern of energy consumption:

Small windows of 1 hour (e.g. T:0:-4 mean) to capture recent fluctuations in energy
Slightly larger windows of 5 hours to capture larger changes throughout the day
Larger windows of 12 hours to capture cyclic day & night patterns
Largest windows of 1 day to 1 week to capture seasonal transitions

We did not use windows larger than 1 week because from the autocorrelation plot, we felt that values too distant in the past may only noise, not to mention add extra dimensions to our input.

With the windowed features of energy and wx3 temperature, dummy variables and 2d time features, we conducted preliminary experiments on AutoCaffe to eliminate unhelpful features, which we found to be ‘range’, ‘skew’ and ‘kurtosis’. We had hoped that ‘skew’ and ‘kurtosis’ could signal to our model the recent presence of extreme values (e.g. a short, sudden heatwave with higher than normal temperatures against a background of normal temperatures) that might increase its robustness in anticipating unexpected events. Unfortunately, these features did not improve our test loss and lag despite repeated experiments. Regarding wx4, we did try our best to utilise it, such as by having a ‘previous month’s average temperature’ calculated across all 4 sensors, but such features unfortunately did not improve our results.

After ~150 experiments, we generated a refined list of ~130 features.

Feature Selection: using XGBoost SHAP feature importance values

At this stage, to provide rigorous justification to our feature selection process, we tapped on the Python XGBoost library, a fast and user-friendly implementation of the gradient-boosting decision trees algorithm. We chose decision trees as they are better at handling high-dimensional datasets (>100 columns of features) than neural networks, which are more prone to drawing poor decision boundaries due to the curse of dimensionality and unimportant inputs.

We fed the ~130 features into an XGBoost regressor model to predict the difference between T:0 and T+96 energy values (mimicking the ‘difference’ neural network in AutoCaffe). Interestingly, this model with the following rather standard hyperparameters achieved a test MSE of 0.010853 (after factor of 0.5), which already beats persistence of 0.019377 by 44%, although we did not visualise its lag correlation nor scatterplot.

Using the Python SHAP library [4, 5], we could easily visualise the contribution of various features to the XGBoost model’s outputs. SHAP was chosen due to its consistency and accuracy across models, which many feature importance regimes lack [6], including XGBoost’s/scikit-learn’s built-in versions. We ranked the top features out of the ~130 by SHAP values (higher = more important), and focused on permutating these top features during an additional round experimentation on AutoCaffe. These top features are very likely to facilitate better clustering of the data, which should be transferable to neural networks. Of course, we understood that inherent differences exist in the algorithms of gradient-boosted trees vs neural networks; the SHAP feature importance values are not the be-all and end-all, and we did include other features occasionally.

Figure 14: Top features ranked by SHAP values (top feature = most important) for ~130 features fed into a standard XGBoost Regressor. Note that for wx3, T:0 actually refers to the time being predicted for (24 hours ahead from T:0 for energy) as we have access to temperature forecast data. Please read on for a brief explanation of our syntax.

To clarify our syntax, Emean_diff96–96:-195 is the average (to reduce noise) of the following: 1) E:-96 minus E:-192 2) E:-97 minus E:-193 … 4) E:-99 minus E:-195. Similarly, Emean_diff48–0:-51 is the average of: 1) E:0 minus E:-48 2) E:-1 minus E:-49 … 4) E:-3 minus E:-51.

For 1week_meandiffdiff: we first calculate seven 1st-order differentials Emean_diff96–0:-99, Emean_diff96–96:-195, … Emean_diff96–576:-675. We then calculate the successive difference between these values (to get six 2nd-order differentials) and take their average to get a single value of 1week_meandiffdiff. We could have equally used the six 2nd-order differentials without taking an average, but we wanted to minimise the number of features and did not want to introduce too much ‘unimportant inputs’.

Results and Discussion: Features

Our best neural network achieved a test loss of 0.00907207 (with 50 repeats) and test lag of 0, beating our persistence of 0.01937720 by 53%. We also calculated a more realistic persistence benchmark, where predictions for the next Monday/Saturday are from the previous Monday/Saturday instead of 24 hours ago, as the energy demand on weekends and weekdays differ significantly. Still, our test loss beats this ‘smart’ persistence of 0.0163 by 44%.

The 36 features we used for our difference network were (many of them from figure 14)

Time features

2d time: hourx, houry, dayx, day, monthx, monthy
Day of the week (0 = Monday, 6 = Sunday, scaled by MinMax to [0,1])

3. Dummy variables: (as inspired by our data analysis)

is_pubhol, is_weekday, is_working_hour, is_season_transition, is_season_winter, is_lunchtime
Note that we dropped is_weekend & is_season_summer to avoid multi-collinearity problems where, across the same row, the values of one column in our data can be inferred from the values of other columns

4. Energy features:

Energy at T:-1, First-order difference between T:0 & T:-1, Min & Mean of T:0:-4, Mean of first-order difference from T:0:-4 (1 hour window)
First-order difference between T:0 and T:-20 (5 hour window)
Min of T:0:-48, First-order difference between T:0 & T:-48 (12 hour window)
Mean of T:0:-96 (1 day window)
First-order difference between T:-96 &T:-192 (1 day window)
Energy at T:-576, Max of T:0:-576 (1 week window from T+96)

5. Temperature features:

Wx3 at T+96, Max & Min of T+96:+92, first-order difference between T:+96 & T+95 (1 hour window)
Min, Max & Mean of T+96:0 (1 day window)
Mean & Max of T+96:-576 (1 week window)

6. Nonlinear features

Square of wx3 at T+96
Product of energy at T:0 & wx3 at T+96

After seeing improvements from using the square of wx3 at T+96, we experimented with a variety of other nonlinear features, cube of wx3 at T+96, ln(wx3 at T+96 in Kelvins), square root of wx3 at T+96 etc. The idea for the product of energy at T:0 & wx3 at T+96 was a natural extension from thinking about the quadratic polynomial (a+b)² = a² + b² + 2ab

Results and Discussion: Model

Our best model:

A simple difference network that predicts the difference in energy consumption between T+96 and T:0, with the model’s prediction added to the energy at T:0
4 layers with 32 perceptrons in the first layer with layer-by-layer shrinking ratio of 2/3 (i.e. 32, 21, 14, 9 perceptrons)
Dropout probability of 0, tanh activation, 10,000 iterations with early stopping & Adam optimiser

For most experiments, the number of layers was kept at 3 or 4, screening first-layer perceptron counts of 32, 64, 128 and occasionally 256, with ReLU activation for fast training. We quickly found that perceptron count of 32 to 64 gave the best results, and that tanh activation enabled superior test loss, albeit at the cost of slower training. The intuition behind why tanh seems to give better results than ReLU might be its additional non-linearity, which might be crucial for mapping complex relationships in energy/temperature data. tanh also avoids some issues faced by ReLU like the ‘dying ReLU’ where a neuron with negative activation is unlikely to ever become positive again.

2 layers were insufficient for optimal learning, while 5 or 6 layers quickly overfitted. Control experiments with the SGD optimiser gave dismal results. Trials with square perceptrons, autoencoders/scaling and force/momentum losses did not improve our results. Still, we think there remains an opportunity to harness autoencoders on more columns of windowed energy/wx3 values to reduce dimensionality whilst extracting the most useful information from historical data.

Figure 15: Graphs of training & test predictions, scatterplots and lagged correlation values

Our model’s strengths & weaknesses

Strengths:

Comfortable gap between the main lag peak at T=0 and secondary lag peak at T=+96. Zero lag means no time delay in our forecasting, which provides building managers with a reliable, on-time ‘benchmark’ to compare their building’s actual energy consumption where any jarring differences may indicate, for example, otherwise unnoticed faults in the building’s heating/cooling systems.
Fairly good predictions during the early winter and the spring months. Our model is able to predict the recurring weekly trends well, anticipating the rise during weekdays/daytime and dips during weekends/night-time.
Overall, minimal jarringly anomalous predictions throughout the test period.

Weaknesses:

During the late winter-early spring transition period, our model struggles to predict extreme energy consumption values, as shown in the green boxes on the test prediction graph, also reflected by the two green ovals in the test scatterplot. That being said, there are about ~16000 points in the test dataset, and most of the points seem to lie close to the reasonable 1:1 relationship between prediction and actual. The anomalies appear to constitute a relatively small percentage of total points.
Unfortunately, despite our best attempts, we were unable to further lower our test loss and overcome this issue. This period is characterised by large fluctuations in temperature and energy consumption, which may be contributing to our model’s difficulties.

Below are zoomed in graphs of predictions vs actuals, which again highlight the strength of our model in closely predicting the seasonal patterns in early winter and the spring months (green boxes), and its weakness in failing to predict some extreme values during February to March 2016 (red boxes). We note that in Dec-Jan & April-May particularly, the green boxes showed a good fit between actual & prediction graphs.

Figure 16a: Green box: good predictions, orange box: our model predicts public holidays accurately

Figure 16b: Green box: good predictions, orange box: our model predicts public holidays accurately

Figure 16c: Red box: where our model did not correctly predict peaks & dips

After scrutinising these graphs, we thought that perhaps, our model may not always be at fault. It may entirely be possible that for the peaks/dips that our model failed to accurately predict, they were really outliers/unexpected energy consumption behavior. This does make some sense if we closely re-examine the red boxes in Fig 16c: the first red box shows a sustained high energy consumption for a weekend, which is unusual. Furthermore, the second peak just after the second box falls on a Saturday, yet another weekend. Perhaps certain overtime weekend operations/work were running in those late winter months that is not included in our training data.

That being said, we recognise that for the second red box, which is from Wednesday to Friday, which are weekdays, our model did fail to match the actual peaks & underpredicted. This warrants further analysis; perhaps there was a deviation in correlation between wx3 temperature & actual energy. From our research, weather factors other than temperature, such as wind speed and humidity affects humans’ ‘feel’ temperature, and thus, use of heating/cooling. Thus, having more data might help us judge whether the actual trends were ‘forecast-able’ or actually outliers.

A note on interpretable machine learning

The SHAP feature values are an extremely rich source of information about the multiple relationships that exist in our dataset. In contrast, it would be more difficult to delve into the ‘blackbox’ of neural networks to understand how weights and biases tell a story about different features. After all, it would be more beneficial for the energy forecasting community if we could also gain insights into how certain features shape the model’s output, rather than blindly hunting for the lowest test loss, where the resultant model may not be transferable to different contexts. Therefore, we made it a point to visualise the SHAP feature importance graphs again on a ‘difference’ XGBoost regressor fitted just on our best 36 features.

Figure 15: Top features ranked by SHAP values for best 36 features fed into a standard XGBoost Regressor. Note that for wx3 specifically, T:0 is the temperature at the time being predicted for (24 hours ahead from T:0 for energy) as we have access to temperature forecast data.

For working hours, it is no surprise that a clear separation exists, where a high value (i.e. =1) at the time being predicted for tends to increase energy consumption, while a low value (i.e. =0) decreases it. Still, the presence of a range of predictions hints that is_working_hours is interacting with other features. Similarly, for dayofweek (Monday = 0 to Sunday = 6 minmax scaled to [0,1]), high values (Saturday & Sunday) tend to drive down the model’s output, while lower values increase it.

More interestingly, the graph suggests that high values of energy consumption 1 day and 15 minutes ago (ET:-1) are more likely to result in decreased energy consumption right now, with the converse being true too (although the magnitude is much lower for the converse). Additionally, for high values of ET:-1, we see hints of feature interaction from the variance in model output ranging from -0.25 to ~0. Also, high maximum energy consumption over the past week from T+96 (Emax_0to576) tends to slightly increase the model’s output, while low values have minimal effect. As for temperature, the graph implies that low values of moving weekly average temperature (wx3mean_-0to-672) can both increase and decrease the model’s output, while high values appear to have no effect. The reasons for these are not immediately clear, and follow-up studies can be conducted to examine them further. That being said, we must note that while SHAP values can indicate high correlation between features & model output, they do not imply causation.

Conclusion

In conclusion, we have built a relatively accurate neural network model to predict energy consumption for a building 1 day ahead, trained on about a year of historical energy data as well as temperature forecasts up to a day ahead.

The strengths of ANNs include the sheer performance bump it has over conventional machine learning methods. However, the trade-off for this performance bump is that large amounts of data are required (the curse of dimensionality) and thus computational costs could get heavy in both time and monetary aspects. The inner workings of how the ANN learns also remains a “black box”, and we had to use extra packages like the SHAP values to indirectly explore relative feature importance values.

It must be noted that the results of this project are a “validation” loss since we have used the test loss values to change our feature combinations and improve our model. Given the limited data we had, we did not have the privilege of a validation set and relied on test loss as a proxy. This may have caused subtle overfitting to the validation set given that we have conducted a few hundred experiments. Thus, it would be ideal to evaluate our model again on an unseen energy dataset for the same building for a more unbiased estimate of its predictive power.

While we hope that our findings are applicable to other contexts, the type of building and its climate should always be considered. The same trends may not apply for a residential building or a shopping mall, or an office building located in an equatorial climate like Singapore’s, which lacks distinct seasons. We should also be mindful of climate change, which may lead to temperatures (and patterns of temperature change) rarely seen historically, and may pose a problem for models that heavily rely on historical data.

Overall, we have truly learnt a tonne from this end-to-end experience, from exercising our object-oriented Python programming skills in building the pre-processing, feature engineering and windowing pipelines, sharpening our data and statistical intuition with extensive visualisations and literature reviews, to understanding the caveats behind different machine learning approaches and making cautious decisions based on algorithms’ results. We brainstormed so many ideas during the competition (including having separate models for winter, transition season and summer) but had only so much time (and data) to try them all.

To end off, we thank ai4impact and NTU CAO for organising such a fun and valuable opportunity!

Connect with us on LinkedIn!

References:

1. Valor, E., Meneu, V., & Caselles, V. Daily air temperature and electricity load in Spain (2001). Journal of applied Meteorology, 40(8), 1413–1421. https://doi.org/10.1175/1520-0450(2001)040<1413:DATAEL>2.0.CO;2.

2. Moon, J., Park, S., Rho, S., & Hwang, E. A comparative analysis of artificial neural network architectures for building energy consumption forecasting (2019). International Journal of Distributed Sensor Networks, 15(9), 1550147719877616. https://doi.org/10.1177/1550147719877616

3. Saurav, A. Dummy Variable Trap (2019). https://medium.com/datadriveninvestor/dummy-variable-trap-c6d4a387f10a

4. Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., … & Lee, S. I. From local explanations to global understanding with explainable AI for trees (2020). Nature machine intelligence, 2(1), 2522–5839. https://doi.org/10.1038/s42256-019-0138-9

5. Lundberg, S. M., Nair, B., Vavilala, M. S., Horibe, M., Eisses, M. J., Adams, T., … & Lee, S. I. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery (2018). Nature biomedical engineering, 2(10), 749–760. https://doi.org/10.1038/s41551-018-0304-0

6. Amjad, A. The Multiple faces of ‘Feature importance’ in XGBoost (2019). https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7