posted on 2023-05-07 20:04 read(684) comment(0) like(11) collect(0)
[The Eleventh Teddy Cup Data Mining Challenge in 2023] Question B: Data Analysis and Demand Forecasting Modeling of Product Orders and Python Code Detailed Explanation Question 1 [The Eleventh Teddy Cup Data Mining Challenge in 2023]
B Question: Data Analysis and Demand Forecast Modeling of Product Orders and Detailed Explanation of Python Code Question 2
For the complete topic, please refer to the article in Question 1.
Based on the analysis of Question 1, establish a mathematical model to predict the sales in the next three months (ie January, February, and March 2019) for the products given in the attached forecast data (predict_sku1.csv). Monthly demand, save the forecast results in the format of Table 3 as the file result1.xlsx, and submit it together with the paper. Please make predictions according to the time granularity of day, week, and month respectively, and try to analyze the impact of different prediction granularities on the prediction accuracy.
This is a time forecasting model. Commonly used time series forecasting models include:
In this task, it is a multi-input time forecasting problem. In the time series forecasting model, the models using multiple inputs mainly include the following:
These models can improve the accuracy and interpretability of time series forecasting by introducing multiple external variables, but they also need to pay attention to the problems of overfitting and variable selection. In specific applications, it is necessary to select an appropriate model according to the characteristics of the data and the prediction target.
在此任务中,首先,读取训练集和预测集数据,并将训练集中的日期列转换为日期类型,并将其设置为数据集的索引。接着,将数据按照一定的维度进行分组,并将每个组的时间序列进行了平稳性检验,若不平稳则进行一阶或者多阶差分,直到序列平稳。然后,使用 SARIMA 模型对每个分组的差分序列进行拟合,并预测未来三个月的需求量。在预测过程中,针对每个预测样本,根据其销售区域、产品、大类和细类,生成外部变量,用于对模型进行外部扰动。最后,将预测结果保存到 Excel 文件中。
Since the complete data is not provided at present, when running the following code, an error will be reported: ValueError: sample size is too short to use selected regression component, because there are too few product samples in a certain area of the data set at this time, which is not enough to constitute the time sequence, so cannot be differentiated.
import pandas as pd
import statsmodels.api as sm
from datetime import datetime, timedelta
train_data = pd.read_csv('data/order_train0.csv')
predict_data = pd.read_csv('data/predict_sku0.csv')
train_data['order_date'] = pd.to_datetime(train_data['order_date'])
train_data = train_data.set_index('order_date')
def make_stationary(ts):
# 一阶差分
ts_diff = ts.diff().dropna()
# 进一步差分,直到平稳
while not sm.tsa.stattools.adfuller(ts_diff)[1] < 0.05:
ts_diff = ts_diff.diff().dropna()
return ts_diff
train_ts_diff = train_ts.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']).apply(make_stationary)
order = (1, 1, 1)
seasonal_order = (1, 0, 1, 12)
model = sm.tsa.statespace.SARIMAX(train_ts_diff, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
result =
# 预测未来三个月的数据
start_date = datetime(2019, 1, 1)
end_date = datetime(2019, 3, 31)
predict_dates = pd.date_range(start=start_date, end=end_date, freq='M')
# 预测每个销售区域、产品、大类和细类的需求量
predict = pd.DataFrame()
for i in range(len(predict_data)):
# 生成外部变量
predict_exog = pd.DataFrame(predict_data.iloc[i, :]).T.set_index(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])
predict_exog.index = pd.MultiIndex.from_tuples(predict_exog.index)
predict_exog = predict_exog.reindex(index=train_ts_diff.index.union(predict_exog.index), fill_value=0).sort_index()
predict_exog = predict_exog.loc[predict_dates]
# 预测未来三个月的需求量
predict_diff = result.get_forecast(steps=len(predict_dates), exog=predict_exog, dynamic=True)
# 将预测出的差分值加上训练集最后一月的差分值
predict_diff_predicted = predict_diff.predicted_mean
predict_diff_predicted = predict_diff_predicted + train_ts_diff.iloc[-1]
# 将差分值转换为预测值
predict_predicted = predict_diff_predicted.cumsum() + train_ts.iloc[-1]
# 将预测结果保存到DataFrame中
predict_temp = pd.DataFrame({'sales_region_code': [predict_data.iloc[i, 0]], 'item_code': [predict_data.iloc[i, 1]],
'2019年1月预测需求量': predict_predicted.loc['2019-01-01':'2019-01-31'].sum(),
'2019年2月预测需求量': predict_predicted.loc['2019-02-01':'2019-02-28'].sum(),
'2019年3月预测需求量': predict_predicted.loc['2019-03-01':'2019-03-31'].sum()})
predict = pd.concat([predict, predict_temp], ignore_index=True)
# 将预测结果保存到Excel文件中
predict.to_excel('result1.xlsx', index=False)
import pandas as pd
import statsmodels.api as sm
from datetime import datetime, timedelta
train_data = pd.read_csv('data/order_train0.csv')
predict_data = pd.read_csv('data/predict_sku0.csv')
train_data['order_date'] = pd.to_datetime(train_data['order_date'])
train_data = train_data.set_index('order_date')
train_ts = train_data.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])['ord_qty'].resample('D').sum()
def make_stationary(ts):
# 一阶差分
ts_diff = ts.diff().dropna()
# 进一步差分,直到平稳
while not sm.tsa.stattools.adfuller(ts_diff)[1] < 0.05:
ts_diff = ts_diff.diff().dropna()
return ts_diff
order = (1, 1, 1)
seasonal_order = (1, 0, 1, 12)
model = sm.tsa.statespace.SARIMAX(train_ts_diff, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
result =
# 预测未来三个月的数据
start_date = datetime(2019, 1, 1)
end_date = datetime(2019, 3, 31)
predict_dates = pd.date_range(start=start_date, end=end_date, freq='D')
# 预测每个销售区域、产品、大类和细类的需求量
predict = pd.DataFrame()
for i in range(len(predict_data)):
# 生成外部变量
predict_exog = pd.DataFrame(predict_data.iloc[i, :]).T.set_index(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])
predict_exog.index = pd.MultiIndex.from_tuples(predict_exog.index)
predict_exog = predict_exog.reindex(index=train_ts_diff.index.union(predict_exog.index), fill_value=0).sort_index()
predict_exog = predict_exog.loc[predict_dates]
# 预测未来三个月的需求量
predict_diff = result.get_forecast(steps=len(predict_dates), exog=predict_exog, dynamic=True)
# 将预测出的差分值加上训练集最后一天的差分值
predict_diff_predicted = predict_diff.predicted_mean
predict_diff_predicted = predict_diff_predicted + train_ts_diff.iloc[-1]
# 将差分值转换为预测值
predict_predicted = predict_diff_predicted.cumsum() + train_ts.iloc[-1]
# 将预测结果保存到DataFrame中
predict_temp = pd.DataFrame({'sales_region_code': [predict_data.iloc[i, 0]], 'item_code': [predict_data.iloc[i, 1]],
'first_cate_code': [predict_data.iloc[i, 2]], 'second_cate_code': [predict_data.iloc[i, 3]],
'2019年1月预测需求量': predict_predicted.loc['2019-01-01':'2019-01-31'].sum(),
'2019年2月预测需求量': predict_predicted.loc['2019-02-01':'2019-02-28'].sum(),
'2019年3月预测需求量': predict_predicted.loc['2019-03-01':'2019-03-31'].sum()})
predict = pd.concat([predict, predict_temp], ignore_index=True)
# 将预测结果保存到Excel文件中
predict.to_excel('result1.xlsx', index=False)
source:python black hole net
Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.
Comment content: (supports up to 255 characters)
Copyright © 2018-2021 python black hole network All Rights Reserved All rights reserved, and all rights reserved.京ICP备18063182号-7
For complaints and reports, and advertising cooperation, please contact or QQ3083709327
Disclaimer: All articles on the website are uploaded by users and are only for readers' learning and communication use, and commercial use is prohibited. If the article involves pornography, reactionary, infringement and other illegal information, please report it to us and we will delete it immediately after verification!