News from this site

 Rental advertising space, please contact the webmaster if you need cooperation


+focus
focused

classification  

no classification

tag  

no tag

date  

no datas

[The 11th Teddy Cup Data Mining Challenge in 2023] Question B: Data Analysis and Demand Forecasting Modeling of Product Orders and Detailed Explanation of Python Code Question 1

posted on 2023-05-21 17:27     read(472)     comment(0)     like(16)     collect(5)


insert image description here

Related Links

(1) Modeling scheme

[The Eleventh Teddy Cup Data Mining Challenge in 2023] Question B: Data Analysis and Demand Forecasting Modeling of Product Orders and Python Code Detailed Explanation Question 1 [The Eleventh Teddy Cup Data Mining Challenge in 2023]
B Question: Data Analysis and Demand Forecast Modeling of Product Orders and Detailed Explanation of Python Code Question 2

(2) Papers on relevant competition topics

[The 11th Teddy Cup Data Mining Challenge in 2023] Question A: Analysis of COVID-19 Epidemic Prevention and Control Data 32-page and 40-page papers and implementation code

[The 11th Teddy Cup Data Mining Challenge in 2023] Topic B: Data Analysis and Demand Forecasting of Product Orders 23-page paper and implementation code

[The 11th Teddy Cup Data Mining Challenge in 2023] Question C: Construction of a 27-page paper and implementation code for a two-way recommendation system for recruitment and job hunting on Teddy’s internal promotion platform

1 topic

one. problem background

In recent years, the external environment of enterprises has become more and more uncertain, and the complex and changeable external environment has made the supply chain of enterprises face more difficulties.

Demand forecasting is the first line of defense in an enterprise's supply chain, and its importance is self-evident. However, demand forecasting is affected by various factors, resulting in generally low forecasting accuracy. Therefore, more excellent algorithms are needed to solve this problem. Demand forecasting is a theoretically based conclusion based on historical data and future predictions, which is helpful for the company's management to make decision-making references for future sales and operation plans, goals, and capital budgets; secondly, demand forecasting is helpful for procurement planning And arrange the formulation of production plans to reduce the impact of business fluctuations. If there is no demand forecast or the forecast is inaccurate, many internal decisions about sales, procurement, and financial budgets in the company can only be based on experience, which will lead to insufficient market forecasts, resulting in backlogs or shortages of inventory and funds, etc. Inventory costs.

two. the data shows

The training data (order_train1.csv) in the attachment provides the shipment data of a large domestic manufacturing company to dealers from September 1, 2015 to December 20, 2018 (see Table 1 for the format), reflecting the company's products Price and demand information in different sales regions, including: order_date (order date), sales_region_code (sales region code), item_code (product code), first_cate_code (product category code), second_cate_code (product category code), sales_chan_name ( sales channel name), item_price (product price), and ord_qty (order demand quantity).

Table 1: Data format of training quantity (historical data)
insert image description here

Among them, "order date" is the date of a certain demand; one "major product category code" corresponds to multiple "product category codes"; "sales channel name" is divided into online (online) and offline (offline), "Online" refers to e-commerce platforms such as Taobao and JD.com, and "offline" refers to offline physical dealers.

The forecast data (predict_sku1.csv) in the attachment provides the sales area code, product code, product category and product category of the product to be forecasted (see Table 2 for the format).

Table 2: Sample data for products that require forecasting
insert image description here

three. issues that need resolving

  1. Please conduct an in-depth analysis of the training data (order_train1.csv) in the attachment, you can refer to but not limited to the following main

question.

(1) The impact of different prices of products on the quantity demanded;

(2) 产品所在区域对需求量的影响,以及不同区域的产品需求量有何特性;

(3) 不同销售方式(线上和线下)的产品需求量的特性;

(4) 不同品类之间的产品需求量有何不同点和共同点;

(5) 不同时间段(例如月头、月中、月末等)产品需求量有何特性;

(6) 节假日对产品需求量的影响;

(7) 促销(如 618、双十一等)对产品需求量的影响;

(8) 季节因素对产品需求量的影响。

  1. 基于上述分析,建立数学模型,对附件预测数据(predict_sku1.csv)中给出的产品,预测未来 3 月(即 2019 年 1 月、2 月、3 月)的月需求量,将预测结果按照表 3 的格式保存为文件 result1.xlsx,与论文一起提交。请分别按天、周、月的时间粒度进行预测,试分析不同的预测粒度对预测精度会产生什么样的影响。
    insert image description here

2 问题分析

2.1 问题一

(1)产品的不同价格对需求量的影响

首先,读取数据并提取item_price和ord_qty两列数据; 然后,根据item_price进行分组统计,计算每个价格区间的平均需求量; 最后,通过散点图将不同价格区间的平均需求量进行可视化展示。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 读取数据
df = pd.read_csv('data/order_train0.csv')
# 按照产品价格分组,并计算平均值
grouped = df.groupby('item_price')['ord_qty'].mean().reset_index()

# 使用 Matplotlib 画图
plt.figure(figsize=(10, 6))
plt.plot(grouped['item_price'], grouped['ord_qty'], 'o-')
plt.xlabel('Product Price')
plt.ylabel('Average Order Quantity')
plt.title('Relationship between Product Price and Order Quantity')
plt.savefig('img/1.png',dpi=300)
# 使用 Seaborn 画图
sns.set_style('darkgrid')
plt.figure(figsize=(10, 6))
sns.lineplot(x='item_price', y='ord_qty', data=grouped)
plt.xlabel('Product Price')
plt.ylabel('Average Order Quantity')
plt.title('Relationship between Product Price and Order Quantity')
plt.savefig('img/2.png',dpi=300)

insert image description here
insert image description here

从图表中可以看出,产品价格与平均订单需求量之间呈现出U形关系,即价格较低或较高时,订单需求量较高;而当价格处于中间区间时,订单需求量较低。这可能是因为价格过低会让消费者觉得产品质量不高,而价格过高则会让消费者觉得不值得购买。因此,合理的定价策略可以在一定程度上提高产品的销售量。

也可以使用回归模型(例如线性回归、多项式回归等)对产品价格和需求量之间的关系进行建模和预测,从而确定价格对需求量的影响。

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# 读取数据
df = pd.read_csv('order_train1.csv')

# 绘制散点图
sns.scatterplot(x='item_price', y='ord_qty', data=df)

# 绘制箱线图
sns.boxplot(x='item_price', y='ord_qty', data=df)

# 使用线性回归模型拟合
x = df[['item_price']]
y = df[['ord_qty']]
model = LinearRegression()
model.fit(x, y)
# 输出模型系数和截距
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)

(2)产品所在区域对需求量的影响,以及不同区域的产品需求量有何特性

可以通过对不同区域的需求量进行可视化分析,例如绘制直方图、箱线图等,查看需求量的分布情况。也可以使用ANOVA方差分析等方法来判断不同区域之间的需求量是否存在显著差异,从而确定产品所在区域对需求量的影响。

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

# 读取数据
df = pd.read_csv('order_train1.csv')

# 绘制直方图
sns.histplot(x='ord_qty', hue='sales_region_code', data=df, kde=True)

# 绘制箱线图
sns.boxplot(x='sales_region_code', y='ord_qty', data=df)

# 进行ANOVA方差分析
grouped_data = df.groupby('sales_region_code')['ord_qty'].apply(list)
。。。略,请下载完整代码
print('F-value:', f_value)
print('P-value:', p_value)

insert image description here
insert image description here

(3)不同销售方式(线上和线下)的产品需求量的特性

可以通过绘制不同销售方式的需求量直方图、箱线图等方法来查看产品需求量的分布情况和差异。也可以使用t检验等方法来确定不同销售方式之间的需求量是否存在显著差异。

然后,我们可以按照销售渠道名称(sales_chan_name)将数据分为线上和线下两类,计算它们的订单需求量(ord_qty)的基本统计量,包括均值、中位数、最大值、最小值、标准差等,以了解它们的分布情况和差异性。

import pandas as pd

# 读取数据
data = pd.read_csv('order_train1.csv')

# 查看数据
print(data.head())

# 将数据按照销售渠道名称分为线上和线下两类
online_data = data[data['sales_chan_name'] == 'online']
offline_data = data[data['sales_chan_name'] == 'offline']

# 计算线上和线下订单需求量的基本统计量
print('线上订单需求量的基本统计量:')
print(online_data['ord_qty'].describe())

print('线下订单需求量的基本统计量:')
print(offline_data['ord_qty'].describe())

insert image description here

除了计算订单需求量的基本统计量之外,我们还可以通过可视化方式更加直观地了解不同销售方式下产品需求量的特性。在 Python 中,我们可以使用 Matplotlib 或者 Seaborn 库进行数据可视化。

import seaborn as sns

# 设置图形风格
sns.set(style="ticks", palette="pastel")

# 绘制箱线图,分析线上和线下订单需求量的分布情况
sns.boxplot(x="sales_chan_name", y="ord_qty", data=data)

# 显示图形
sns.despine(trim=True)

insert image description here

运行上述代码,可以得到一个箱线图,展示了线上和线下订单需求量的分布情况。通过比较箱线图的位置、大小和形状等特征,我们可以了解不同销售方式下产品需求量的差异性和分布情况。例如,如果线上订单需求量的中位数明显高于线下订单需求量的中位数,那么我们可以判断线上销售渠道对产品需求量的贡献较大。

import matplotlib.pyplot as plt

# 提取线上和线下订单需求量
online_ord_qty = data[data["sales_chan_name"] == "online"]["ord_qty"]
offline_ord_qty = data[data["sales_chan_name"] == "offline"]["ord_qty"]

# 绘制线上和线下订单需求量直方图
。。。略,请下载完整代码
labels = ['Online', 'Offline']

plt.bar(labels, X)
plt.title('Distribution of Sales Channels')
plt.xlabel('Sales Channels')
plt.ylabel('Sales Volume')
plt.show()

insert image description here

核密度图可以更加直观地展示数据的分布情况,它可以通过对数据进行平滑处理,得到一条连续的曲线,反映了数据的概率密度分布情况。

import seaborn as sns

# 提取线上和线下订单需求量
online_ord_qty = data[data["sales_chan_name"] == "online"]["ord_qty"]
offline_ord_qty = data[data["sales_chan_name"] == "offline"]["ord_qty"]

# 绘制线上和线下订单需求量核密度图
sns.kdeplot(online_ord_qty, shade=True, label="Online")
sns.kdeplot(offline_ord_qty, shade=True, label="Offline")
plt.legend(loc="upper right")
plt.title("Distribution of Order Quantity by Sales Channel")
plt.xlabel("Order Quantity")
plt.ylabel("Density")
plt.show()

从核密度图中可以看出,线下销售方式下的产品需求量分布相对于线上销售方式更加集中,呈现出一个明显的峰态;而线上销售方式下的产品需求量分布比较平滑,没有出现明显的峰态。同时,线下销售方式下的产品需求量整体偏高,而线上销售方式下的产品需求量整体偏低。

insert image description here

# 绘制散点图
sns.scatterplot(data=train_data, x="item_price", y="ord_qty", hue="sales_chan_name")

insert image description here

从散点图中可以看出,线下销售方式下产品价格与需求量之间的关系似乎比线上销售方式下更加紧密,而且线下销售方式下有一些高价格、高需求量的异常值。但是需要注意的是,由于数据中的产品价格和需求量都是离散值,所以散点图中的点是会有重叠的。

(4)不同品类之间的产品需求量有何不同点和共同点;

  1. 按照品类分组,计算每个品类的订单需求量的平均值、中位数、标准差等统计指标;
  2. 绘制每个品类的订单需求量的分布直方图;
  3. 对于不同品类之间的需求量进行比较分析,找出不同品类之间的不同点和共同点。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 读取数据
data = pd.read_csv('order_train1.csv')

# 按照品类分组,计算每个品类的订单需求量的平均值、中位数、标准差等统计指标
category_demand = data.groupby('second_cate_code')['ord_qty'].agg(['mean', 'median', 'std'])
。。。略,请下载完整代码
# 绘制每个品类的订单需求量的分布直方图
category_list = data['second_cate_code'].unique().tolist()
for category in category_list:
    demand = data.loc[data['second_cate_code'] == category, 'ord_qty']
    plt.hist(demand, bins=30)
    plt.title(f'Cate:{category}')
    plt.xlabel('Demand')
    plt.ylabel('Frequency')
    plt.show()

# 对于不同品类之间的需求量进行比较分析,找出不同品类之间的不同点和共同点
# 可以使用t检验、方差分析等统计方法

insert image description here
insert image description here

(5)不同时间段(例如月头、月中、月末等)产品需求量有何特性;

  1. 将订单日期按月份进行分组,计算每个月份的订单需求量的平均值、中位数、标准差等统计指标;
  2. 绘制每个月份的订单需求量的趋势图;
  3. 将每个月份的订单需求量按照日期进行分组,分别计算月初、月中、月末的订单需求量的平均值、中位数、标准差等统计指标;
  4. 对于不同时间段之间的需求量进行比较分析,找出不同时间段之间的不同点和共同点。

为了研究不同时间段产品需求量的特性,我们需要首先将订单日期进行拆分,提取出月初、月中和月末三个时间段的需求量。可以使用 pandas 中的 dt 属性来获取日期时间中的年、月、日、小时等信息。在这里,我们可以使用 pandas 中的 cut 函数对订单日期进行分段,然后对不同时间段的订单需求量进行统计。

import pandas as pd

# 读取数据
data = pd.read_csv('order_train1.csv')

# 转换订单日期格式为 datetime 类型
data['order_date'] = pd.to_datetime(data['order_date'], format='%y/%m/%d')

# 根据订单日期将数据进行排序
data = data.sort_values(by='order_date')

# 按照月初、月中、月末将订单需求量进行分组
。。。略,请下载完整代码
time_bins = [0, 10, 20, 31]
data['order_date_category'] = pd.cut(data['order_date'].dt.day, bins=time_bins, labels=time_labels)

# 统计不同时间段的订单需求量
demand_by_time = data.groupby('order_date_category')['ord_qty'].sum()

# 绘制不同时间段的订单需求量柱状图
demand_by_time.plot(kind='bar')

insert image description here
insert image description here
insert image description here

(6)节假日对产品需求量的影响:

节假日通常会对消费者的购买行为产生影响,因此对产品需求量也会有影响。在此问题中,我们可以选取国内的法定节假日,对节假日和非节假日进行对比分析。

为了分析节假日对产品需求量的影响,可以先对数据进行处理,找出所有的节假日以及对应的日期。在本数据集中,可以通过观察订单日期(order_date)列来确定节假日日期,例如春节、国庆节等。然后,可以计算出每个节假日的平均需求量,将其与普通日的需求量进行比较,从而分析节假日对产品需求量的影响。

  1. 加载数据集并进行数据预处理,将订单日期(order_date)转换为日期格式,然后根据日期确定是否为节假日,将其标记为1,否则标记为0。
  2. 根据标记将数据集分成两部分,一部分为节假日数据,一部分为非节假日数据。
  3. 对于节假日数据和非节假日数据,计算每天的平均需求量。
  4. 将结果可视化,比较节假日和非节假日的平均需求量,观察是否存在明显差异。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import holidays

# 加载数据集并进行数据预处理
df = pd.read_csv('data/order_train0.csv')
df['order_date'] = pd.to_datetime(df['order_date'])
df['is_holiday'] = df['order_date'].isin(holidays.China(years=[2015,2016,2017,2018]))
df['is_holiday'] = df['is_holiday'].astype(int)

# 将数据集分成两部分:节假日数据和非节假日数据
。。。略,请下载完整代码

# 计算每天的平均需求量
holiday_demand = holiday_df.groupby(['order_date'])['ord_qty'].mean()
non_holiday_demand = non_holiday_df.groupby(['order_date'])['ord_qty'].mean()

# 可视化比较节假日和非节假日的平均需求量
plt.figure(figsize=(10,6))
plt.plot(holiday_demand.index, holiday_demand.values, label='Holiday')
plt.plot(non_holiday_demand.index, non_holiday_demand.values, label='Non-Holiday')
plt.title('Average demand on holiday vs non-holiday')
plt.xlabel('Date')
plt.ylabel('Average demand')
plt.legend()
plt.show()

insert image description here

(7)促销对产品需求量的影响:

Promotional activities usually increase product sales and therefore have an impact on product demand. In this problem, we can select some promotional activities and compare and analyze the promotional period and non-promotional period.

  1. For promotion day data and non-promotion day data, calculate the average demand for each day.
  2. Visualize the results and compare the average demand on promotional days and non-promotional days to see if there are significant differences.
  3. Compare the average order demand during the promotion period and the non-promotion period to analyze the impact of the promotion on the product demand.
import pandas as pd
import matplotlib.pyplot as plt

# 加载数据集
df = pd.read_csv('data/order_train0.csv', parse_dates=['order_date'])
df['order_date'] = pd.to_datetime(df['order_date'], format='%y/%m/%d')

# 按照促销日期将数据集分成两部分
promo_dates = [pd.to_datetime('2016-06-18'), pd.to_datetime('2016-11-11')]
df_promo = df[df['order_date'].isin(promo_dates)]
df_nonpromo = df[~df['order_date'].isin(promo_dates)]

# 计算促销和非促销期间的每天平均需求量
。。。略,请下载完整代码

# 可视化结果
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(promo_mean_qty.index, promo_mean_qty.values, label='Promo')
ax.plot(nonpromo_mean_qty.index, nonpromo_mean_qty.values, label='Non-Promo')
ax.set_xlabel('Date')
ax.set_ylabel('Average Demand')
ax.set_title('Impact of Promotions on Product Demand')
ax.legend()
plt.show()

insert image description here

Compare the average order demand during the promotion period and the non-promotion period to analyze the impact of the promotion on the product demand.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. 确定促销期
promotions = ['2015/6/18', '2015/11/11', '2016/6/18', '2016/11/11', '2017/6/18', '2017/11/11', '2018/6/18']

# 2. 加载并预处理数据
df = pd.read_csv('data/order_train0.csv', parse_dates=['order_date'], dtype={'sales_region_code': 'str'})
df['is_promotion'] = df['order_date'].isin(promotions).astype(int)
df_agg = df.groupby(['order_date'])['ord_qty'].sum().reset_index()

# 3. 计算促销期和非促销期的订单需求量
df_promo = df_agg[df_agg['order_date'].isin(promotions)]
df_nonpromo = df_agg[~df_agg['order_date'].isin(promotions)]
promo_mean = df_promo['ord_qty'].mean()
nonpromo_mean = df_nonpromo['ord_qty'].mean()

# 4. 可视化比较促销期和非促销期的订单需求量
。。。略,请下载完整代码
ax.bar(['Promotion', 'Non-promotion'], [promo_mean, nonpromo_mean])
ax.set_xlabel('Period')
ax.set_ylabel('Average order quantity')
ax.set_title('Effect of promotions on order quantity')
plt.show()

insert image description here

As can be seen from the bar graph, the average demand for products that participate in the promotion is higher than the average demand for products that do not participate in the promotion. This suggests that promotional activities have a positive impact on product demand.

8. The impact of seasonal factors on product demand

  1. Convert order dates to seasons and aggregate order demand by quarter.
  2. For each season, plot a histogram and kernel density plot of order demand and a scatterplot of order demand versus product price.
import pandas as pd
import matplotlib.pyplot as plt

# 读取数据
df = pd.read_csv('order_train1.csv')

# 将订单日期转换为季节
def date_to_season(date):
    year, month, day = map(int, date.split('/'))
    if month in (3, 4, 5):
        return 'Spring'
    elif month in (6, 7, 8):
        return 'Summer'
    elif month in (9, 10, 11):
        return 'Autumn'
    else:
        return 'Winter'
    
df['Season'] = df['order_date'].apply(date_to_season)

# 按季度聚合订单需求量
。。。略,请下载完整代码

# 绘制直方图和核密度图
for season in ['Spring', 'Summer', 'Autumn', 'Winter']:
    plt.figure(figsize=(8,6))
    plt.hist(df[df['Season'] == season]['ord_qty'], bins=20, alpha=0.5, color='blue')
    df[df['Season'] == season]['ord_qty'].plot(kind='density', secondary_y=True)
    plt.title('Demand Distribution in ' + season)
    plt.xlabel('Order Demand')
    plt.ylabel('Frequency / Density')
    plt.show()

# 绘制散点图
for season in ['Spring', 'Summer', 'Autumn', 'Winter']:
    plt.figure(figsize=(8,6))
    plt.scatter(df[df['Season'] == season]['item_price'], df[df['Season'] == season]['ord_qty'], alpha=0.5)
    plt.title('Demand vs. Price in ' + season)
    plt.xlabel('Item Price')
    plt.ylabel('Order Demand')
    plt.show()

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

It can be seen from the results that there are differences in the distribution of order demand in different seasons, for example, the order demand in winter is generally higher, while that in summer is generally lower. In addition, there are certain differences in the relationship between order demand and product prices in different seasons. For example, in spring and autumn, there is a certain positive correlation between order demand and product prices, but it does not exist in summer and winter. obvious correlation.

2.2 Question 2

[The 11th Teddy Cup Data Mining Challenge in 2023] Question B: Data Analysis and Demand Forecasting Modeling of Product Orders and Detailed Explanation of Python Code Question 2

3 complete code

computer browser open

betterbench.top/#/49/detail



Category of website: technical article > Blog

Author:kimi

link:http://www.pythonblackhole.com/blog/article/25305/5cbee5c4d03d28cbe4c4/

source:python black hole net

Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.

16 0
collect article
collected

Comment content: (supports up to 255 characters)