[The 11th Teddy Cup Data Mining Challenge in 2023] Question B: Data Analysis and Demand Forecasting Modeling of Product Orders and Detailed Explanation of Python Code Question 2

posted on 2023-05-07 20:04 read(684) comment(0) like(11) collect(0)

Please add a picture description

1 topic

For the complete topic, please refer to the article in Question 1.
Based on the analysis of Question 1, establish a mathematical model to predict the sales in the next three months (ie January, February, and March 2019) for the products given in the attached forecast data (predict_sku1.csv). Monthly demand, save the forecast results in the format of Table 3 as the file result1.xlsx, and submit it together with the paper. Please make predictions according to the time granularity of day, week, and month respectively, and try to analyze the impact of different prediction granularities on the prediction accuracy.
insert image description here

2 Analysis of Question 2

2.1 Problem Analysis

This is a time forecasting model. Commonly used time series forecasting models include:

Autoregressive Moving Average Model (ARMA)
Autoregressive Integrated Moving Average Model (ARIMA)
Seasonal Autoregressive Integrated Moving Average Model (SARIMA)
Autoregressive conditional heteroscedastic model (ARCH)
Long Short-Term Memory Model (LSTM)

In this task, it is a multi-input time forecasting problem. In the time series forecasting model, the models using multiple inputs mainly include the following:

ARIMAX model: On the basis of the ARIMA model, exogenous variables are added as the input of the model to consider the impact of external factors on the time series.
VAR model: Vector Autoregression Model (VAR) is a multivariate time series model that can consider the mutual influence relationship between multiple time series.
LSTM model: Long Short-Term Memory (LSTM) is a recurrent neural network that can model long-term dependencies of time series and also supports multiple input models.
Prophet model: A predictive model developed by Facebook, which uses an additive model and can consider multiple exogenous variables, with better interpretability.
SARIMAX model: The seasonal autoregressive moving average model (Seasonal ARIMA with External Regressors, SARIMAX) is an extension of the ARIMAX model, which can consider seasonal changes and also supports the input of multiple exogenous variables.

These models can improve the accuracy and interpretability of time series forecasting by introducing multiple external variables, but they also need to pay attention to the problems of overfitting and variable selection. In specific applications, it is necessary to select an appropriate model according to the characteristics of the data and the prediction target.

2.2 Modeling steps of time series forecasting problem

Data cleaning and processing: cleaning and processing of historical data, including removal of outliers, missing value processing, etc. In addition, the data also needs to be sorted according to the time series.
Time Series Decomposition: Decompose time series data into trend, seasonal, and random components. This can be done by fitting an additive model or a multiplicative model. The additive model assumes that the sum of the seasonal component and the trend component is equal to the original data, while the multiplicative model assumes that the product of the seasonal component and the trend component is equal to the original data.
Model selection and fitting: Choose an appropriate time series model to fit trend, seasonal, and random components. Commonly used models include ARIMA model, exponential smoothing model, etc.
Model diagnosis: Diagnose the fitted model, check whether the residuals conform to the normal distribution, whether there is autocorrelation, etc.
Model prediction: use the fitted model to predict future demand and calculate the prediction accuracy.

在此任务中，首先，读取训练集和预测集数据，并将训练集中的日期列转换为日期类型，并将其设置为数据集的索引。接着，将数据按照一定的维度进行分组，并将每个组的时间序列进行了平稳性检验，若不平稳则进行一阶或者多阶差分，直到序列平稳。然后，使用 SARIMA 模型对每个分组的差分序列进行拟合，并预测未来三个月的需求量。在预测过程中，针对每个预测样本，根据其销售区域、产品、大类和细类，生成外部变量，用于对模型进行外部扰动。最后，将预测结果保存到 Excel 文件中。

2.3 改进的角度

有许多方法可以改进时间序列预测模型，下面列出了几种常见的方法：

调整模型参数：可以通过调整模型的参数来改善模型的性能。例如，对于ARIMA模型，可以调整p、d、q参数，对于LSTM模型，可以调整神经元数量、学习率、迭代次数等参数。需要注意的是，参数调整需要进行交叉验证等方法来评估模型的性能和泛化能力。
增加特征：通过增加更多的特征来提高模型的预测准确度。除了历史数据特征和时间特征之外，还可以考虑其他相关特征，例如，天气数据、经济数据等，可以对时间序列的预测结果产生影响。
数据增强：通过增加更多的历史数据来提高模型的预测准确度。可以通过扩展历史数据范围或增加数据精度等方式来增加历史数据。
模型融合：将多个模型的预测结果进行加权平均或堆叠等方式来提高模型的预测准确度。模型融合可以通过多个模型的优点互补来提高整体的预测效果。
使用集成学习：集成学习是一种通过将多个基本模型进行组合来提高整体预测效果的方法。例如，可以通过Bagging、Boosting等方式将多个决策树、LSTM等基本模型进行组合。
调整训练数据：可以通过对训练数据进行平滑处理、滑动窗口等方式来提高模型的预测准确度。例如，可以通过移动平均、指数平滑等方式对训练数据进行平滑处理。

需要注意的是，模型改进需要进行交叉验证等方法来评估模型的性能和泛化能力，以避免模型过拟合或欠拟合的情况。

3 python实现

Since the complete data is not provided at present, when running the following code, an error will be reported: ValueError: sample size is too short to use selected regression component, because there are too few product samples in a certain area of the data set at this time, which is not enough to constitute the time sequence, so cannot be differentiated.

3.1 Taking months as the time granularity

对以下代码进行注释，并说明思路：
import pandas as pd
import statsmodels.api as sm
from datetime import datetime, timedelta

train_data = pd.read_csv('data/order_train0.csv')
predict_data = pd.read_csv('data/predict_sku0.csv')

train_data['order_date'] = pd.to_datetime(train_data['order_date'])
train_data = train_data.set_index('order_date')

。。。略，请下载完整代码

def make_stationary(ts):
    # 一阶差分
    ts_diff = ts.diff().dropna()
    # 进一步差分，直到平稳
    while not sm.tsa.stattools.adfuller(ts_diff)[1] < 0.05:
        ts_diff = ts_diff.diff().dropna()
    return ts_diff

train_ts_diff = train_ts.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']).apply(make_stationary)

order = (1, 1, 1)
seasonal_order = (1, 0, 1, 12)

model = sm.tsa.statespace.SARIMAX(train_ts_diff, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
result = model.fit()

# 预测未来三个月的数据

start_date = datetime(2019, 1, 1)
end_date = datetime(2019, 3, 31)
predict_dates = pd.date_range(start=start_date, end=end_date, freq='M')

# 预测每个销售区域、产品、大类和细类的需求量

predict = pd.DataFrame()
for i in range(len(predict_data)):
    # 生成外部变量
    predict_exog = pd.DataFrame(predict_data.iloc[i, :]).T.set_index(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])
    predict_exog.index = pd.MultiIndex.from_tuples(predict_exog.index)
    predict_exog = predict_exog.reindex(index=train_ts_diff.index.union(predict_exog.index), fill_value=0).sort_index()
    predict_exog = predict_exog.loc[predict_dates]
    # 预测未来三个月的需求量
    predict_diff = result.get_forecast(steps=len(predict_dates), exog=predict_exog, dynamic=True)

    # 将预测出的差分值加上训练集最后一月的差分值
    predict_diff_predicted = predict_diff.predicted_mean
    predict_diff_predicted = predict_diff_predicted + train_ts_diff.iloc[-1]

    # 将差分值转换为预测值
    predict_predicted = predict_diff_predicted.cumsum() + train_ts.iloc[-1]

   # 将预测结果保存到DataFrame中
    predict_temp = pd.DataFrame({'sales_region_code': [predict_data.iloc[i, 0]], 'item_code': [predict_data.iloc[i, 1]],
                                 '2019年1月预测需求量': predict_predicted.loc['2019-01-01':'2019-01-31'].sum(),
                                 '2019年2月预测需求量': predict_predicted.loc['2019-02-01':'2019-02-28'].sum(),
                                 '2019年3月预测需求量': predict_predicted.loc['2019-03-01':'2019-03-31'].sum()})
    predict = pd.concat([predict, predict_temp], ignore_index=True)

# 将预测结果保存到Excel文件中
predict.to_excel('result1.xlsx', index=False)

3.2 Taking days as the time granularity

import pandas as pd
import statsmodels.api as sm
from datetime import datetime, timedelta

train_data = pd.read_csv('data/order_train0.csv')
predict_data = pd.read_csv('data/predict_sku0.csv')
train_data['order_date'] = pd.to_datetime(train_data['order_date'])
train_data = train_data.set_index('order_date')
train_ts = train_data.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])['ord_qty'].resample('D').sum()


def make_stationary(ts):
    # 一阶差分
    ts_diff = ts.diff().dropna()
    # 进一步差分，直到平稳
    while not sm.tsa.stattools.adfuller(ts_diff)[1] < 0.05:
        ts_diff = ts_diff.diff().dropna()
    return ts_diff

。。。略，请下载完整代码
order = (1, 1, 1)
seasonal_order = (1, 0, 1, 12)


model = sm.tsa.statespace.SARIMAX(train_ts_diff, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
result = model.fit()

# 预测未来三个月的数据
start_date = datetime(2019, 1, 1)
end_date = datetime(2019, 3, 31)
predict_dates = pd.date_range(start=start_date, end=end_date, freq='D')

# 预测每个销售区域、产品、大类和细类的需求量
predict = pd.DataFrame()
for i in range(len(predict_data)):
    # 生成外部变量
    predict_exog = pd.DataFrame(predict_data.iloc[i, :]).T.set_index(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])
    predict_exog.index = pd.MultiIndex.from_tuples(predict_exog.index)
    predict_exog = predict_exog.reindex(index=train_ts_diff.index.union(predict_exog.index), fill_value=0).sort_index()
    predict_exog = predict_exog.loc[predict_dates]

    # 预测未来三个月的需求量
    predict_diff = result.get_forecast(steps=len(predict_dates), exog=predict_exog, dynamic=True)

    # 将预测出的差分值加上训练集最后一天的差分值
    predict_diff_predicted = predict_diff.predicted_mean
    predict_diff_predicted = predict_diff_predicted + train_ts_diff.iloc[-1]

    # 将差分值转换为预测值
    predict_predicted = predict_diff_predicted.cumsum() + train_ts.iloc[-1]

    # 将预测结果保存到DataFrame中
    predict_temp = pd.DataFrame({'sales_region_code': [predict_data.iloc[i, 0]], 'item_code': [predict_data.iloc[i, 1]],
                                 'first_cate_code': [predict_data.iloc[i, 2]], 'second_cate_code': [predict_data.iloc[i, 3]],
                                 '2019年1月预测需求量': predict_predicted.loc['2019-01-01':'2019-01-31'].sum(),
                                 '2019年2月预测需求量': predict_predicted.loc['2019-02-01':'2019-02-28'].sum(),
                                 '2019年3月预测需求量': predict_predicted.loc['2019-03-01':'2019-03-31'].sum()})
    predict = pd.concat([predict, predict_temp], ignore_index=True)

# 将预测结果保存到Excel文件中
predict.to_excel('result1.xlsx', index=False)