2023 National Competition Question C: Complete Analysis of Key Points for Review of Automatic Pricing and Replenishment Decisions of Vegetable Commodities

All analyzes in this article only represent personal opinions, do not represent official opinions, and are for reference only.
CSDN: Chuan Chuan Rookie
Public Account: Chuan Chuan takes you to learn AI
` Insert image description here
'

1. Question 1

Insert image description here

1.1 First point

Deal with abnormal data and situations such as discounts, returns, and no-sale items.

First, let’s talk about data anomalies. The method I used: First, use box plots to detect sales data, which is roughly as follows:
Insert image description here
How to deal with it after detection? For the detected outliers, we can use the quartile method to process the outliers, replacing the outliers with the median instead of deleting them directly .

Discount sales data distribution:
Insert image description here
how to deal with it? I have nothing to do with this. I think this is real data, so I can only encode "yes" and "no" as 01, such a simple mapping encoding .

The return data is negative, and the return is sales volume, so I use cumulative processing . For example, if I sell 10 potatoes, and then the customer returns two potatoes (-2), then the real sales volume is 10+(-2) = 8, so I think accumulation is a feasible way.

12 Second point

When studying the distribution and change patterns of categories and single products, time effects should be considered.

Interpretation: More clearly consider time effects, such as seasonality, holiday effects, etc., in the distribution and change patterns of categories and single products. I think it’s enough to consider one of these points. Most people should consider seasonality and just analyze it.

Data conversion:

merged_data = pd.merge(data2, data1, on='单品编码')

# 将销售日期转换为日期格式,并提取月份和季节作为新的列
merged_data['销售日期'] = pd.to_datetime(merged_data['销售日期'])
merged_data['月份'] = merged_data['销售日期'].dt.month
merged_data['季节'] = merged_data['销售日期'].dt.month % 12 // 3 + 1
merged_data['周内天数'] = merged_data['销售日期'].dt.weekday

merged_data

Get the season, month, and day of the week as follows:
Insert image description here

# 季节性影响
plt.figure(figsize=(10, 4),dpi=150)
seasonal_effect = merged_data.groupby(['分类名称', '季节'])['销量(千克)'].mean().reset_index()
sns.barplot(x='季节', y='销量(千克)', hue='分类名称', data=seasonal_effect)
plt.title("季节性影响")
plt.show()

# 月度影响
plt.figure(figsize=(10, 4),dpi=150)
monthly_effect = merged_data.groupby(['分类名称', '月份'])['销量(千克)'].mean().reset_index()
sns.barplot(x='月份', y='销量(千克)', hue='分类名称', data=monthly_effect)
plt.title("月度影响")
plt.show()

# 周内效应
plt.figure(figsize=(10, 4),dpi=150)
weekday_effect = merged_data.groupby(['分类名称', '周内天数'])['销量(千克)'].mean().reset_index()
sns.barplot(x='周内天数', y='销量(千克)', hue='分类名称', data=weekday_effect)
plt.title("周内效应")
plt.show()

as follows:
Insert image description here
Insert image description here
Insert image description here

1.3 The third point

In the correlation analysis of category and single product sales, the conditions for using the correlation analysis method should be considered. Discussion of category or item distribution types should be encouraged.

First of all, in the correlation analysis of category and single product sales, the conditions for using the correlation analysis method should be considered. At this point I tested the data for normality distribution:

import scipy.stats as stats

# 可视化检查(直方图)
plt.figure(figsize=(12, 6),dpi=150)
sns.histplot(data2['销量(千克)'], bins=30, kde=True)
plt.title('销售量的直方图')
plt.show()

# 可视化检查(概率图)
plt.figure(figsize=(12, 6),dpi=150)
stats.probplot(data2['销量(千克)'], plot=plt)
plt.title('销售量的概率图')
plt.show()

# 计算偏度
skewness = data2['销量(千克)'].skew()
print(f"数据的偏度: {skewness}")

# 正态性检验(Shapiro-Wilk检验)作用:便于我们后面的相关性里面选择哪个方法
_, p_value = stats.shapiro(data2['销量(千克)'])
print(f"Shapiro-Wilk检验的p值: {p_value}")

# 根据p-value判断是否符合正态分布
alpha = 0.05
if p_value > alpha:
    print("数据呈正态分布(无法拒绝 H0)")
else:
    print("数据未呈正态分布(拒绝 H0)")

The result is as follows:
Insert image description here

数据的偏度: 0.6795134003514179
Shapiro-Wilk检验的p值: 0.0
数据未呈正态分布(拒绝 H0)

It certainly does not meet Pearson's premise requirements, so Spearman is used for visual correlation analysis.

Pearson correlation coefficient (for normally distributed data) or Spearman and Kendall rank correlation coefficient (for non-normally distributed or ordinal data)

1.4 Point 4

Simple visualizations and statistical descriptions are not enough.

First, statistical description and some visualization are necessary and cannot be done without. In this existing data, we can analyze many visualizations, and each visualization will have certain analysis results. I won't show it here.

For example, such a columnar distribution:
Insert image description here
and heat map:
Insert image description here
and fan chart:
Insert image description here
etc., I think these all require some analysis, not just display pictures, each picture has a certain meaning.

Question 2

Insert image description here

The first point

Analyze whether there is a correlation between sales volume, replenishment volume and price of each category from both qualitative and quantitative aspects: If relevant, a quantitative relationship model should be established.

Here's what I did: I used an ARIMA model and a genetic algorithm to predict sales and develop a replenishment and pricing strategy.

Qualitative analysis: Analyze the impact of replenishment volume and price on sales volume through business logic and market research. For example, high prices may reduce sales volume, but they may also increase total revenue.

Quantitative analysis: Use statistical tests (Spearman's correlation) to quantify the relationship between replenishment volume and price and sales volume.

Second point

It is a good way to consider the interdependence between replenishment quantity and price from a mechanism perspective.

The replenishment quantity and price do have an interdependent relationship in real business, and the normal relationship should be as follows:

  1. High prices may lead to lower demand, so replenishment volumes should be reduced accordingly.
  2. Low replenishment levels can lead to stockouts, prompting higher prices to balance demand

You can use visualization to show it:

import seaborn as sns
import matplotlib.pyplot as plt
grouped_data=merged_data
# Create a subset of the data for visualization
subset_data = grouped_data[['销售日期', '单品名称', '单次销量(千克)', '单位总成本', '推荐售价']].copy()
subset_data['销售日期'] = pd.to_datetime(subset_data['销售日期'])
subset_data.set_index('销售日期', inplace=True)

# 1. 散点图
plt.figure(figsize=(10, 6))
sns.scatterplot(data=subset_data, x='单位总成本', y='单次销量(千克)', hue='单品名称')
plt.title('Scatter Plot: Unit Cost vs Sales Volume')
plt.xlabel('Unit Cost (/千克)')
plt.ylabel('Sales Volume (千克)')
plt.legend(title='Product Name', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

# 2. 热力图
correlation_matrix = subset_data.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Heatmap: Correlation between Variables')
plt.show()

# 3. 多维度平行坐标图
from pandas.plotting import parallel_coordinates

top_5_products = subset_data['单品名称'].value_counts().index[:5]
subset_top_5 = subset_data[subset_data['单品名称'].isin(top_5_products)]

plt.figure(figsize=(12, 6))
parallel_coordinates(subset_top_5, '单品名称', cols=['单位总成本', '推荐售价', '单次销量(千克)'])
plt.title('Parallel Coordinates Plot: Multi-Dimensional Analysis')
plt.legend(title='Product Name', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

) Scatter plot
Insert image description here
) Heat map
Insert image description here
) Parallel coordinate plot
Insert image description here

The third point

Time factors of data should be considered, such as seasonality, periodicity, holidays, trends, etc.

This is similar to question 1. The time factor still needs to be considered. After all, it is a time series. I conducted time series analysis using ARIMA (Autoregressive Integrated Moving Average Model) and LSTM (Long Short-Term Memory Network), both of which capture the seasonality, cyclicality, and trends of the data. However, for the holiday factor, these two models did not directly deal with it (forgot to analyze this)

Interested students can use more advanced time series models, such as Facebook's Prophet model, to try the impact of holidays on their own. Reference code:

from fbprophet import Prophet


product_data = product_data.rename(columns={
    
    '销售日期': 'ds', '单次销量(千克)': 'y'})
model = Prophet(yearly_seasonality=True, weekly_seasonality=True, daily_seasonality=False)
model.fit(product_data)
future_dates = model.make_future_dataframe(periods=365)  # assuming you want to forecast the next 365 days
forecast = model.predict(future_dates)

fig = model.plot(forecast)

fourth point

The specific results of replenishment volume and pricing for each category in the coming week should be given, and the difference between working days and weekends must be reflected.

Through the time series prediction algorithm and genetic algorithm, I finally gave the predicted replenishment volume and pricing strategy. There is no specific reference answer for this, it just needs to be reasonable.

Regarding working days and weekends, you can add a variable such as "whether it is a weekend or not" in the time series model, or perform model training and prediction on the data of working days and weekends separately, so as to obtain different replenishment quantities and pricing strategies. PS: I didn’t do this. I didn’t expect to predict data for so many days and also consider weekends.

Question three

The first point

Determine the substitutable or complementary items in each category. For example, you can classify items in each category through correlation analysis.

My understanding is that the correlation between each item can be analyzed using a heat map. Correlation analysis can help us understand the relationship between two variables. If two items have a high positive correlation in sales, they may be complementary. Conversely, if they have a high negative correlation, they may be substitutable.

Second point

On the premise of considering the substitutability and complementarity of single products, a quantitative method for commodity variety diversity is given to meet the constraints of commodity demand and variety diversity.

Single product substitutability and complementarity: through correlation analysis and clustering methods, specifically Spearman correlation and Kmeans+elbow method.

Quantification method of commodity variety diversity: Shannon diversity index can be considered.

# 计算总销售量
total_sales = forecast_results_new_method['预测销量_2023-07-01'].sum()

# 计算每种商品的相对丰度(这里使用预测的销售量)
forecast_results_new_method['relative_abundance'] = forecast_results_new_method['预测销量_2023-07-01'] / total_sales

# 计算Shannon多样性指数
shannon_diversity_index = -np.sum(forecast_results_new_method['relative_abundance'] * np.log(forecast_results_new_method['relative_abundance']))

shannon_diversity_index

The value is: 3.83996693368468. I think the explanation that can be made here is that the product varieties have quite high diversity.

The third point

The model and method of question 2 can be used for reference, but the difference between category and single product decision-making must be clarified.

The main differences are as follows:

  1. Decision-making level: Question two focuses more on category-level decision-making, while question three requires decision-making at the single-product level. Specific to single products, it also involves inventory control and pricing strategies.
  2. Constraints: Question 3 has more constraints, such as the number of single products must be between 27-33, the minimum display quantity of each single product, etc.
  3. Objective function: Question two mainly focuses on maximizing the total revenue of different categories, while question three requires finding the optimal combination of single products to maximize revenue while satisfying various constraints.
  4. Method application: Although similar prediction models and optimization algorithms (such as ARIMA, genetic algorithms, etc.) can be applied to both questions 2 and 3, in question 3, other methods, such as correlation analysis, may need to be applied to determine which Items can be substituted or complementary.

The main difference from the second question is that a lot of constraints have been added. I still use heuristic algorithms and time series algorithms for implementation. In my solution, I have considered these differences and modeled and solved the characteristics of problem three.

fourth point

Comparison of results under different models or optimization schemes is encouraged.

For model comparison, you can compare the differences in time series and the differences in heuristic algorithms.

Question 4

The first point

Give suggestions for collecting new data, such as business data (daily replenishment volume, inventory table, daily loss rate, etc.), external data (weather data, etc.), consumer data, etc., and explain the reasons (how to use new data to improve the model) )

Reasons for daily replenishment:

  1. Knowing daily replenishment levels can help more accurately estimate actual inventory levels and subsequent sales.
  2. How to use it: Used to improve inventory management models and further refine demand forecasts.

Inventory table:

  1. Inventory data can help understand which items are more likely to be out of stock or overstocked.
  2. How to use: Used to optimize replenishment strategies and reduce the risk of out-of-stock and slow sales.

Daily loss rate:

  1. Why: Daily shrinkage data can help more accurately predict the actual amount of merchandise available for sale.
  2. How to leverage: Integrate into sales forecasting models to improve forecast accuracy.

Weather data:

  1. Rationale: Weather conditions (such as temperature, precipitation) may affect the sales of certain vegetables.
  2. How to leverage: As a feature in a predictive model to consider its potential impact on sales.

Holidays and Events Calendar:

  1. Reason: Special days (such as holidays, promotion days) often affect sales.
  2. How to use: Used in models to account for the impact of these factors on sales.

Consumer purchase history:

  1. Why: Understanding consumer buying habits can help with more personalized recommendations and inventory management.
  2. How to leverage: Used in recommendation systems to increase sales.

Consumer feedback and reviews:

  1. Why: Consumer feedback can be used to understand which items are more popular or have issues.
  2. How to use: Used to improve product quality and adjust inventory.

Second point:

Analyze the feasibility, economics and other factors of data collection.

PS: I don’t think you need to actually collect data, you just need to describe it clearly.

Overall analysis description

Question 1 is a common data analysis, which requires statistical description and visual analysis for different categories.

Problem 2 is prediction + optimization, but there are almost no constraints. So you can use correlated time series algorithms, as well as heuristic algorithms (differential evolution, genetics, simulated annealing, etc.)

Question three: Mainly adding more constraints based on question two, it is still prediction + optimization.

Question 4: Open question, but it must be combined with the topic.

Guess you like

Origin blog.csdn.net/weixin_46211269/article/details/132893287