DengAI: Predicting Disease Spread - Imputation and Stationarity Problems
One of the biggest data challenge on DrivenData, with more than 9000 participants is the DengAI challenge. The objective of this challenge is predict the number of dengue fever cases in two different cities.
This blogpost series covers our journey of tackling this problem, starting from initial data analysis, imputation and stationarity problems up un to the different forecasting attempts. This first post covers the imputation and stationarity checks for both cities in the challenge, before moving on to trying different forecasting methdologies.
Throughout this post, code-snippets are shown in order to give an understanding of how the concepts discussed are implemented into code. The entire Github repository for the imputation and stationary adjustment can be found here.
Furthermore, in order to ensure readability we decided to show graphs only for the city San Jose instead showing it for both cities.
Imputation
Imputation describes the process of filling missing values within a dataset. Given the wide range of possibilities for imputation and the severe amount of missing data within this project, it is worthwhile to go over some of the methods and empirically check which one to use.
Overall, we divide all imputation methods into the two categories: basic and advanced. With basic methods we mean off the shelf, quick imputation methods, which are oftentimes already build into Pandas. Advanced imputation methods deal with model-based approaches where the missing values are attempted to be predicted, using the remaining columns.
Given that the model-based imputation methods normally result in superior performance, the question might arise why we do not simply use the advanced method for all columns. The reason for that is that our dataset has several observations where all features are missing. The presence of these observations make multivariate imputation methods impossible.
We therefore divide the features into two categories. All features, or columns, which have fewer than 1 percent missing observations are imputed using more basic methods, whereas model-based approaches are used for features which exhibit more missing observations than this threshold.
The code snippet below counts the percentage of missing observations, divides all features into one of two aforementioned categories and creates a graph to visualize the results.
def pct_missing_data(self, data, cols, threshold, city_name):
"""
This method does two things. First, it creates a chart showing the
percentages of missing data. Second, it returns which columns
have less than a certain threshold percentage of data, and which
columns have more.
Parameters
----------
data : DataFrame
DataFrame containing all the data
cols : list
List containing all the columns we have to fill
threshold : int
The threshold which divides the easy and difficult columns
city_name : str
A string containing the city name for which we calculate
Returns
-------
easy_columns : list
List containing the columns which have less then the threshold
missing data
diff_columns : list
List containing the columns which have more than the threshold
missing data
"""
# Calculating the percentage missing
num_of_obs = len(data)
num_of_nans = data.loc[:, cols].isna().sum()
df_pct_missing = pd.DataFrame(num_of_nans
/ num_of_obs*100).reset_index()
df_pct_missing.rename(columns={"index": "columns",
0: "pct_missing"}, inplace=True)
df_sorted_values = df_pct_missing.sort_values(by="pct_missing",
ascending=False)
# Column division
bool_easy = df_sorted_values.loc[:, "pct_missing"] < threshold
easy_columns = df_sorted_values.loc[bool_easy, "columns"]
diff_columns = list(set(cols) - set(easy_columns))
# Plotting the data
fig, axs = plt.subplots(figsize=(20, 10))
axs.bar(df_sorted_values.loc[:, "columns"],
df_sorted_values.loc[:, "pct_missing"])
axs.tick_params(axis="both", labelsize=16, labelrotation=90)
axs.set_ylabel("Percentage of missing observations", fontsize=18)
axs.set_xlabel("Variables", fontsize=18)
axs.axhline(threshold, linestyle="dashed", color="red",
label="{} percent threshold".format(threshold))
axs.legend(prop={"size": 20})
fig.savefig(r"{}/{}_miss_data_pct.png".format(output_path, city_name),
bbox_inches="tight")
return easy_columns, diff_columns
The resulting graph below shows four features which have more than 1 percent of observations missing. Especially the feature ndvi_ne, which describes satellite vegetation in the north-west of the city has a severe amount of missing data, with more around 20% of all observation missing.
All imputation methods applied are compared using the normalized root mean squared error (NRMSE). We use this quality estimation method because of its capability for making variables with different scales comparable. Given that the NRMSE is not directly implemented in Python, we use the following snippet to implement it.
def nrmse(self, y_true, y_pred, n):
"""
This function calculates the normalized root mean squared error.
Parameters
----------
y_true : array
The true values
y_pred : array
The predictions
n : int
The number of rows we testing for performance
Returns
-------
rounded_nrmse : float
The resulting, rounded nrmse
"""
ts_min, ts_max = np.min(y_true), np.max(y_true)
mse = sum((y_true-y_pred)**2) / n
nrmse_value = np.sqrt(mse) / (ts_max-ts_min)
rounded_nrmse = np.round(nrmse_value, 2)
return rounded_nrmse
Basic imputation methods
Python, and in particular the library Pandas, has multiple off-the-shelf imputation methods available. Arguably the most basic ones are forward fill (ffill) and backward fill (bfill), where we simply set the missing valueequal to the prior value (ffill) or to the proceeding value (bfill).
Other methods include the linear or cubic (the Scipy package also includes higher power if wanted) interpolation around a missing observation.
Lastly, we can use the average of the k nearest neighbours of a missing observations. For this problem we took the preceding and proceeding four observations of a missing observation and imputed it with the average of these eight values. This is not a build in method and therefore defined by us in the following way:
def knn_mean(self, ts, n):
"""
This function calculates the mean value of the n/2 values before
and after it. This approach is therefore called the k nearest
neighbour approach.
Parameters
----------
ts : array
The time series we would like to impute
n : int
The number of time period before + after we would like
to take the mean of
Returns
-------
out : array
The filled up time series.
"""
out = np.copy(ts)
for i, val in enumerate(ts):
if np.isnan(val):
n_by_2 = np.ceil(n/2)
lower = np.max([0, int(i-n_by_2)])
upper = np.min([len(ts)+1, int(i+n_by_2)])
ts_near = np.concatenate([ts[lower:i], ts[i:upper]])
out[i] = np.nanmean(ts_near)
return out
Now it is time to apply and compare of these methods. We do that by randomly dropping 50 observations of all columns, which are afterwards imputed by all before mentioned methods. Afterwards we assess each method’s performance through their NRMSE score. All of that, and the graphing of the results is done through the following code snippet.
def imputation_table(self, data, cols, city_name):
"""
This method calculates the nrmse for all columns and inserts them
in a table. Additionally a graph is plotted in order for visual
inspection afterwards. The score is calculated by randomly dropping
50 values and then imputing them. Afterwards the performance is
assessed.
Parameters
----------
data : DataFrame
Dataframe which includes all columns
cols : list
List of columns we would like to impute
city_name : str
In order to know which city data was used, we specify the name
Returns
-------
nrmse_df : DataFrame
The results of each method for each column.
"""
nrmse_df = pd.DataFrame(index=cols)
print("Create imputation table")
for col in tqdm(cols):
original_series = data.loc[:, col]
time_series = original_series.dropna().reset_index(drop=True)
n = 50
random.seed(42)
rand_num = random.sample(range(0, len(time_series)), n)
time_series_w_nan = time_series.copy()
time_series_w_nan[rand_num] = np.nan
# Forward fill ----
ts_ffill = time_series_w_nan.ffill()
nrmse_df.loc[col, "ffill"] = self.nrmse(time_series, ts_ffill, n)
# Backward fill ----
ts_bfill = time_series_w_nan.bfill()
nrmse_df.loc[col, "bfill"] = self.nrmse(time_series, ts_bfill, n)
# Linear Interpolation ----
ts_linear = time_series_w_nan.interpolate(method="linear")
nrmse_df.loc[col, "linear"] = self.nrmse(time_series,
ts_linear, n)
# Cubic Interpolation ----
ts_cubic = time_series_w_nan.interpolate(method="cubic")
nrmse_df.loc[col, "cubic"] = self.nrmse(time_series, ts_cubic, n)
# Mean of k nearest neighbours ----
ts_knn = self.knn_mean(time_series_w_nan, 8)
nrmse_df.loc[col, "knn"] = self.nrmse(time_series, ts_knn, n)
# Plotting results
adj_df = nrmse_df.reset_index()
long_format = pd.melt(adj_df, id_vars=["index"], var_name=["nrmse"])
fig, axs = plt.subplots(figsize=(20, 10))
sns.barplot(x="index", y="value", hue="nrmse",
data=long_format, ax=axs)
axs.tick_params(axis="both", labelsize=16, labelrotation=90)
axs.set_ylabel("Normalized Root Mean Squared Root Error", fontsize=18)
axs.set_xlabel("Variables", fontsize=18)
axs.legend(prop={"size": 20})
fig.savefig(r"{}/{}_imput_performance.png".format(output_path,
city_name),
bbox_inches="tight")
return nrmse_df
The resulting graph below clearly shows which method is to be favored, namely the k nearest neighbours approach. The linear method also performs well, even though not as well as the knn method. The more naive methods like ffill and bfill do not perform as strongly.
Afterwards, we impute all features which had fewer observations missing than our threshold of one percent. That means all features except the first four. The code below selects the best method for each column and afterwards imputes all actual missing values.
def fill_by_method(self, original_series, method):
"""
After we know what the best method is for each column, we would
like to impute the missing values. This function lists all
potential methods, except the model build one.
Parameters
----------
original_series : array
The original array with all its missing values
method : str
A string describing the best working method
Returns
-------
time_series : array
The original array now filled the missing values with the
method of choice
"""
if method == "ffill":
time_series = original_series.ffill()
elif method == "bfill":
time_series = original_series.bfill()
elif method == "linear":
time_series = original_series.interpolate(method="linear")
elif method == "cubic":
time_series = original_series.interpolate(method="cubic")
elif method == "knn":
time_series = self.knn_mean(original_series, 8)
return time_series
def fill_easy_columns(self, data, easy_columns, nrmse_df):
"""
This method goes through all easy declared columns and fills them
up
Parameters
----------
data : Dataframe
DataFrame containing all columns
easy_columns : list
List of all columns which can undergo the easy imputation
nrmse_df : DataFrame
Dataframe which contains the performance metrices of
all imputation methods
Returns
-------
data : Dataframe
Dataframe with imputated columns
"""
print("Filling easy columns")
for col in tqdm(easy_columns):
time_series = data.loc[:, col]
best_method = nrmse_df.loc[col, :].sort_values().index[0]
ts_filled = self.fill_by_method(time_series, best_method)
data.loc[:, col] = ts_filled
assert sum(data.loc[:, col].isna()) == 0, \
"Easy imputation went wrong"
return data
The potential flaws of the knn approach
Unfortunately, the superior performance of the knn model comes with a price. For some features, we do not have a only one observation missing at a time, but multiple consecutively missing observations.
If for example we have 12 consecutive missing observations, the knn method cannot calculate any average out of the preceding and proceeding four observations, given that they are missing as well.
The image below, which was created with the beatiful missingno package, shows us that all four columns which were classified as being above our one percent threshold have at one point 15 consecutive missing observations. This makes it impossible to use the knn method for these columns and is the reason why we cannot use this imputation method for the heavily sparse columns.
Model-based imputation methods
The model-based imputation methods use, as already described earlier, the column with the missing observations as the target and uses all other possible columns as the features. After imputing all columns with fewer than one percent missing observations, we can now use all of them as features.
The model we are using is a RandomForestRegressor because of its good handling of noisy data. The code snippet below which hyperparameters were gridsearched.
imputation_model = {
"model": RandomForestRegressor(random_state=28),
"param": {
"n_estimators": [100],
"max_depth": [int(x) for x in
np.linspace(10, 110, num=10)],
"min_samples_split": [2, 5, 10, 15, 100],
"min_samples_leaf": [1, 2, 5, 10]
}
}
We now run all four columns through the model-based approach and compare their performance to all aforementioned basic imputation methods. The following code snippet takes care of exactly that.
def fill_diff_columns(self, data, model, diff_columns,
easy_columns, nrmse_df, city_name):
"""
This method imputes the difficult columns. Difficult means that
these columns miss more than the specified threshold percentage
of observations. Because of that a model based approach is tried.
If this approach proves better than the normal methods, it is
applied.
Furthermore, we plot the nrmse of the model based approach in order
to compare these with the normal methods
Parameters
----------
data : DataFrame
Dataframe containing all data
model : dictionary
Here we specify the model and the parameters we would like to try
diff_columns : list
List of columns we would like to try
easy_columns : list
List of columns which have less than the threshold percentage
data missing
nrmse_df : Dataframe
Dataframe with the nrmse for all methods and columns
city_name : str
String specifying which city we are talking about
Returns
-------
data : Dataframe
Dataframe with imputated columns
diff_nrmse_df : Dataframe
Dataframe showing the nrmse performance of the difficult
columns and all methods
"""
non_knn_method = list(set(nrmse_df.columns) - set(["knn"]))
diff_nrmse_df = nrmse_df.loc[diff_columns, non_knn_method]
print("Filling difficult columns")
for col in tqdm(diff_columns):
# Getting data ready
time_series = data.loc[:, col]
non_nan_data = data.dropna(subset=[col])
features = non_nan_data.loc[:, easy_columns]
scaler = StandardScaler().fit(features)
scaled_features = scaler.transform(features)
target = non_nan_data.loc[:, col]
# Model building and evaluation
model_file_name = "{}/{}_{}_model.pickle".format(data_path,
city_name,
col)
if not os.path.isfile(model_file_name):
model_info = model_train(model, scaled_features,
target,
"neg_mean_squared_error",
False)
with open(model_file_name, "wb") as file:
pickle.dump(model_info, file)
else:
with open(model_file_name, "rb") as f:
model_info = pickle.load(f)
target_min, target_max = np.min(target), np.max(target)
rmse = np.sqrt(abs(model_info["scores"][0]))
nrmse_value = rmse / (target_max-target_min)
diff_nrmse_df.loc[col, "model"] = nrmse_value
# Imputing the difficult ones
argmin_method = np.argmin(diff_nrmse_df.loc[col, :])
best_method = diff_nrmse_df.columns[argmin_method]
bool_target_nan = time_series.isna()
if best_method == "model":
features = data.loc[bool_target_nan, easy_columns]
scaled_features = scaler.transform(features)
pred = model_info["model"].predict(scaled_features)
data.loc[bool_target_nan, col] = pred
else:
pred = self.fill_by_method(time_series, best_method)
data.loc[bool_target_nan, col] = pred
assert data.loc[:, list(easy_columns) + diff_columns]\
.isna().any().any() == False, "Still missing data"
# Plotting results
adj_df = diff_nrmse_df.reset_index()
long_format = pd.melt(adj_df, id_vars=["index"], var_name=["nrmse"])
fig, axs = plt.subplots(figsize=(20, 10))
sns.barplot(x="index", y="value", hue="nrmse",
data=long_format, ax=axs)
axs.tick_params(axis="both", labelsize=16, labelrotation=90)
axs.set_ylabel("Normalized Root Mean Squared Root Error", fontsize=18)
axs.set_xlabel("Variables", fontsize=18)
axs.legend(prop={"size": 20})
fig.savefig(r"{}/{}_diff_columns.png".format(output_path, city_name),
bbox_inches="tight")
return data, diff_nrmse_df
Below we can see that our work was worthwhile. For three out of four columns we find a superior performance of the model-based approach compared to the basic imputation methods. We are now left with a fully imputed dataset with which we can proceed.
Stationarity Problems - Seasonality and Trend
In contrast to cross-sectional data, time series data comes with a whole bunch of different problems. Undoubtedly one of the biggest issues is the problem of stationarity. Stationarity describes a measure of regularity. It is this regularity which we depend on to exploit when building meaningful and powerful forecasting models. The absence of regularity makes it difficult at best to construct a model.
There are two types of stationarity, namely strict and covariance stationarity. In order for a time series to be fulfil strict stationarity, the series needs to be time independent. That would imply that the relationship between two observations of a series is only driven by the timely gap between them, but not on the time itself. This assumption is difficult, if not impossible for most time series to meet and therefore more focus is drawn on covariance stationarity.
For a time series to be covariance stationary, it is required that the unconditional first two moments, so the mean and variance, are finite and do not change with time. It is important to note that the time series is very much allowed to have a varying conditional mean. Additionally, it is required that the auto-covariance of a time series is only depending on the lag number, but not on the time itself. All these requirements are also stated below.
There are many potential reasons for a time series to be non-stationary, including seasonalities, unit roots, deterministic trends and structural breaks. In the following section we will check and adjust our exogenous variable for each of these criteria to ensure stationarity and superior forecasting behavior.
Seasonality
Seasonality is technically a form of non-stationarity because the mean of the time series is dependent on time factor. An example would be the spiking sales of a gift-shop around Christmas. Here the mean of the time series is explicitly dependent on time.
In order to adjust for seasonality within our exogenous variables, we first have to find out which variables actually exhibits that kind of behavior. This is done by applying a Fourier Transform. A Fourier transform disentangles a signal into its different frequencies and assesses the power of each individual frequency. The resulting plot, which shows power as a function of frequency is called a power spectrum. The frequency with the strongest power could then be potentially the driving seasonality in our time series. More information about Fourier transform and signal processing in general can be read up on an earlier blogpost of ours here.
The following code allows us to take a look into the power-plots of our 20 exogenous variable.
def spike_finder(self, data, cols, city_name):
"""
This method calculates the power-plots for all specified
variables. Afterwards spikes above a certain threshold and
which exhibit the desired prominence are marked. Afterwards
an image of all columns is saved
Parameters
----------
data : DataFrame
Dataframe containing all the columns for which we would
like to calculate the power-plots of
cols : list
Columns which we would like to examine
city_name : str
A string denoting which city we are looking at
Returns
-------
spikes_dict : dict
Dictionary which saves the dominant and prominent
frequencies for each exogenous variables
"""
fig, axs = plt.subplots(nrows=4, ncols=5, figsize=(20, 20),
sharex=True)
plt.subplots_adjust(right=None)
fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor="none", top=False, bottom=False,
left=False, right=False)
plt.grid(False)
plt.xlabel("Frequency [1 / Hour]", fontsize=22, labelpad=20)
plt.ylabel("Amplitude", fontsize=22, labelpad=50)
spikes_dict = {}
axs = axs.ravel()
for i, col in enumerate(cols):
signal = data.loc[:, col].copy()
fft_output = fft.fft(signal.values)
power = np.abs(fft_output)
freq = fft.fftfreq(len(signal))
mask = freq > 0
pos_freq = freq[mask]
power = power[mask]
axs[i].plot(pos_freq, power)
axs[i].tick_params(axis="both", labelsize=16)
axs[i].set_title(col, fontsize=12)
relevant_power = power[:int(len(power)/4)]
prominence = np.mean(relevant_power) * 5
threshold = np.mean(relevant_power) + 3 * np.std(relevant_power)
peaks = sig.find_peaks(relevant_power, prominence=prominence,
threshold=threshold)[0]
peak_freq = pos_freq[peaks]
peak_power = power[peaks]
axs[i].plot(peak_freq, peak_power, "ro")
if len(peak_freq) > 0:
spikes_dict[col] = (1/peak_freq).tolist()[0]
else:
spikes_dict[col] = np.nan
The plot below shows the resulting 20 exogenous variables. Whether or not a predominant and significant threshold is met for a variable is indicated by a red dot on top of a spike. If a red dot is visible, that means that the time series has a significantly driving frequency and therefore a strong seasonality component.
One possibility to cross-check the results of the Fourier Transforms is to plot the Autocorrelation function. If we would try have a seasonality of order X, we would expect a significant correlation with lag X. The following snippet of code plots the autocorrelation function for all features and highlights those features which are found to have a seasonal affect according to the Fourier Transform.
def acf_plots(self, data, cols, spike_dict, city_name):
"""
This method plots the autocorrelation functions for all
specified columns in a specified dataframe. Furthermore,
the biggest possible spike for each column, if there is any,
is made visible through a vertical line and a legend
Parameters
----------
data : DataFrame
The dataframe which contains all exogenous variables.
cols : list
A list containing the columns which should be
analysed
spike_dict : dict
A dictionary having all columns as the keys and the
potential spike as the value
city_name : str
A string to save the resulting png properly
Returns
-------
None.
"""
fig, axs = plt.subplots(nrows=4, ncols=5, sharex=True, sharey=True,
figsize=(20, 20))
plt.subplots_adjust(right=None)
fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor="none", top=False, bottom=False,
left=False, right=False)
plt.grid(False)
plt.xlabel("Lags", fontsize=22, labelpad=20)
plt.ylabel("Correlation", fontsize=22, labelpad=50)
axs = axs.ravel()
max_lags = round(np.nanmax(list(spike_dict.values())))
for i, col in enumerate(cols):
series = data.loc[:, col].copy()
sm.graphics.tsa.plot_acf(series.values.squeeze(),
lags=max_lags, ax=axs[i], missing="drop")
axs[i].set_title(col, fontsize=12)
axs[i].tick_params(axis="both", labelsize=16)
if not np.isnan(spike_dict[col]):
axs[i].axvline(spike_dict[col], -1, 1, color="red",
label="Periodicity: {}".format(spike_dict[col]))
axs[i].legend(loc="upper center", prop={'size': 16})
fig.tight_layout()
fig.savefig(r"{}/{}_autocorrelation_function.png".format(output_path,
city_name),
bbox_inches="tight")
From the ACF plots below, we can extract a lot of useful information. First of all, we can clearly see that for all columns where the Fourier transforms find a significant seasonality, we also find confirming picture. This is because we see a peaking and significant autocorrelation at the lag which was found by the power-plot.
Additionally, we find some variables (e.g. ndvi_nw) which exhibit a constant significant positive autocorrelation. This is a sign of non-stationarity, which will be addressed in the next section which will be dealing of stochastic and deterministic trends.
In order to get rid of the seasonal component, we decompose each seasonality-affected feature into its unaffected version its seasonality component and trend component. This is done by the STL decomposition which was developed by Cleveland, McRae & Terpenning (1990). STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating non-linear relationships.
The following code snippet decomposes the relevant time series, and subtracts (given that we face additive seasonalities) the seasonality and the trend from the time series.
def season_trend(self, data, cols, spike_dict):
"""
This method decomposes the time series by removing
(subtracting) the modelled seasonality and trend.
Parameters
----------
data : DataFrame
A dataframe containing the relevant time series
cols : list
A list which specifies all potentially affected columns
spike_dict : dict
A dictionary stating the significant seasonality for
each column
Returns
-------
data : Dataframe
After decomposing and 'cleaning', we put the variables
back into the dataframe which is returned
"""
for col in cols:
period = spike_dict[col]
time_series = data.loc[:, col]
if not np.isnan(period):
res = STL(time_series, period=int(spike_dict[col])+1).fit()
adjusted_series = time_series - res.seasonal
data.loc[:, col] = adjusted_series
return data
Deterministic Trends
One more obvious way to breach the assumptions of covariance stationarity is if the series has a deterministic trend. It is important to stress the difference between a deterministic and not a stochastic trend (unit root). Whereas it is possible to model and remove a deterministic trend, this is not possible with a stochastic trend, given its unpredictable and random behavior.
A deterministic trend is the simplest form of a non-stationary process and time series which exhibit such a trend can be decomposed into three components:
The most common type of trend is a linear trend. It is relatively straight forward to test for such a trend and remove it, if one is found. We apply the original Mann-Kendall test, which does not consider seasonal effects, which we already omitted in the part above. If a trend is found, it is simply subtracted from the time series. These steps are completed in the method shown below.
def trend_detecter(self, data, cols):
"""
This method tests for a deterministic trend using the
Mann-Kendall test. If the test is found to be significant,
the trend is removed (subtracted).
Parameters
----------
data : DataFrame
A dataframe containing all the relevant columns
cols : list
A list of column names for which we apply the test
Returns
-------
no_nan_data : DataFrame
A dataframe with the potentially removed trend series
trend_dict : dict
A dictionary containing the information of the detrending
"""
trend_dict = {}
for col in cols:
trend_dict[col] = {}
time_series = data.loc[:, col]
result = mk.original_test(time_series)
trend_dict[col]["pre_trend"] = result.trend
if result.trend != "no trend":
d_trend = [(result.intercept + result.slope * x)
for x in np.arange(len(time_series))]
trend_dict[col]["intercept"] = result.intercept
trend_dict[col]["slope"] = result.slope
adj_time_series = time_series - d_trend
result = mk.original_test(adj_time_series)
trend_dict[col]["post_trend"] = result.trend
data.loc[:, col] = adj_time_series
no_nan_data = data.dropna(subset=cols).reset_index(drop=True)
return no_nan_data, trend_dict
The result can be viewed here. As we can see, most time series exhibited a linear trend, which was then removed.
| ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_k | reanalysis_avg_temp_k | reanalysis_dew_point_temp_k | reanalysis_max_air_temp_k | reanalysis_min_air_temp_k | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | |
|------------|------------------------|-------------------------|----------|-------------------------|-----------------------|-----------------------|-----------------------------|---------------------------|---------------------------|---------------------------------|--------------------------------------|------------------------------|---------------------------------------|-----------------------|------------------------|-------------------------|------------------------|-------------------------|-----------------------|---------------------|
| Trend | decreasing | decreasing | no trend | decreasing | decreasing | increasing | increasing | increasing | increasing | increasing | decreasing | decreasing | decreasing | increasing | increasing | increasing | decreasing | decreasing | increasing | increasing |
| Slope | -8.804180148944889e-05 | -0.00010508442895612228 | NA | -1.9969121004566154e-05 | -0.003975390783667321 | 0.00079050539377047 | 0.0007647709685904174 | 0.0004375871000444891 | 0.000685165752960625 | 0.0006468026767682056 | -0.012109224127889808 | -0.0017669399624736161 | -0.003975390783667321 | 0.0004258943155599426 | 0.00032172583198841135 | 0.0001547004468416762 | -0.0009195402298850562 | -0.00030887303593728855 | 0.0006413032891119354 | 0.00986137183746052 |
| Intercept | 0.10424117470021918 | 0.12204044630128306 | NA | 0.17419584980022826 | 31.7450108145569 | 298.76309596706034 | 298.8943108197956 | 294.9033123255635 | 301.01619501182535 | 297.00822681183297 | 32.6573650987486 | 79.43547713504724 | 31.7450108145569 | 16.345457478691138 | 2.292595026925666 | 26.96210500459904 | 7.149425287356321 | 31.81129219530351 | 22.338233161530237 | 19.02484232581643 |
| Post Trend | no trend | no trend | NA | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend | no trend |
Even though we removed a deterministic trend, this did not ensure that our time series are actually stationary now. That is because what works for a deterministic trend does not work for a stochastic trend, meaning that the trend-removing we just did does not ensure stationary of unit-roots.
We therefore have to explicitly test for a unit-root in every time series.
Stochastic Trends - Unit roots
A unit root process is the generalization of the classic random walk, which is defined as the succession of random steps. Given this definition, the problem of estimating such a time series are obvious. Furthermore, a unit root process violates the covariance stationarity assumptions of not being dependent on time.
To see why that is the case, we assume an autoregressive model where today’s value only depends on yesterday’s value and an error term.
If we parameter a_1 would now be equal to one, the process would simplify to
By repeated substitution we could also write this expression as:
When now calculating the variance of y_t, we face a variance which is positively and linearly dependent on time, which violates the second covariance stationarity rule.
This would have not been the case if a_1 would be smaller than one. That is also basically what is tested in an unit-root test. Arguably the most well-known test for an unit root is the Augmented Dickey Fuller (ADF) test. This test has the null hypothesis of having a unit root present in an autoregressive model. The alternative is normally that the series is stationary or trend-stationary. Given that we already removed a (linear) trend, we assume that the alternative is a stationary series.
In order to be technically correct, it is to be said that the ADF test is not directly testing that a_1 is equal to zero, but rather looks at the characteristic equation. The equation below illustrates what is meant by that:
We can see that the difference to the equation before is that we do not look at the level of y_t, but rather at the difference of y_t. Capital Delta represent here the difference operator. The ADF is now testing whether the small delta operator is equal to zero. If that would not be the case, then the difference between yesterday’s and tomorrow’s value would depend on yesterday’s value. That would mean if the today’s value is high, the difference between today’s and tomorrow’s value will also be large which is a self-enforcing and explosive process which clearly depends on time and therefore breaks the assumptions of covariance stationarity.
In case of a significant unit-root (meaning a pvalue above 5%), we difference the time series as often as necessary until we find a stationary series. All of that is done through the following two methods.
def dickey_fuller_test(self, data, cols):
"""
Method to test certain rows from a dataframe whether
an unit root is present through the ADF test
Parameters
----------
data : Dataframe
A dataframe which contains all series we would like to
test
cols : list
A list containing all columns names for which we would
like to conduct the test for.
Returns
-------
adf_dict : dict
Dictionary containing the test result for every series.
"""
adf_dict = {}
for col in cols:
time_series = data.loc[:, col].dropna()
result = adfuller(time_series, autolag="AIC", regression="c")
adf_dict[col] = result[1]
return adf_dict
def diff_sign_columns(self, data, cols, adf_dict):
"""
This method differences the time series if a non significant
dickey fuller test is shown. This is done as long as the
adf is not significant.
Parameters
----------
data : Dataframe
A dataframe containing all the time series we would like to test
cols : list
List of column names we would like to test
adf_dict : dict
dictionary containing the test results of the dickey fuller test
Returns
-------
data : DataFrame
A dataframe with the now potentially differenced series
adf_dict : dict
A dictionary with the now significant dickey fuller tests
number_of_diff : dict
A dictionary telling how often each series was differenced.
"""
number_of_diff = {}
for col in cols:
pvalue = adf_dict[col]
time_series = data.loc[:, col].dropna()
while pvalue > 0.05:
time_series = time_series.diff(periods=1)
pvalue = adfuller(time_series.dropna(),
autolag="AIC",
regression="c")[1]
number_of_diff[col] = sum(time_series.isna())
adf_dict[col] = pvalue
data.loc[:, col] = time_series
return data, adf_dict, number_of_diff
The following table shows that we do not find any significant ADF test, meaning that no differencing was needed and that no series exhibited a significant unit root.
Finishing up
Last but not least we take a look at our processed time series. It is nicely visible that none of the time series are trending anymore and they do not exhibit significant seasonality anymore.
Additionally we take a look at how the distributions of all of the series look. It is important to note that there are no distributional assumptions of the feature variables when it comes to forecasting. That means that even if we find highly skewed variables, it is not necessary to apply any transformation.
After sufficiently transforming all exogenous variables, it is now time to shift our attention on the forecasting procedure of both cities.