【第十一届泰迪杯数据挖掘挑战赛】A 题：新冠疫情防控数据的分析思路+代码（持续更新）

posted on 2023-05-21 17:28 read(928) comment(0) like(13) collect(2)

[The 11th Teddy Cup Data Mining Challenge] Question A: Analysis ideas + codes for COVID-19 epidemic prevention and control data (continuously updated)

problem background
Solve the problem
code download
data analysis
Task1
task2
Task3
Task4

problem background

	自 2019 年底至今，全国各地陆续出现不同程度的新冠病毒感染疫情，如何控制疫情蔓
延、维持社会生活及经济秩序的正常运行是疫情防控的重要课题。大数据分析为疫情的精准
防控提供了高效处置、方便快捷的工具，特别是在人员的分类管理、传播途径追踪、疫情研
判等工作中起到了重要作用，为卫生防疫部门的管理决策提供了可靠依据。疫情数据主要包
括人员信息、场所信息、个人自查上报信息、场所码扫码信息、核酸采样检测信息、疫苗接
种信息等。
	本赛题提供了某市新冠疫情防疫系统的相关数据信息，请根据这些数据信息进行综合分
析，主要任务包括数据仓库设计、疫情传播途径追踪、传播指数估计及疫情趋势研判等。

Solve the problem

According to the travel time and place of the positive person in the nucleic acid test, the close contacts are tracked, and the result is saved in the
"result1.csv" file (see the result1.csv in Attachment 1 for the file template).
Based on the results of Question 1, track the corresponding secondary contacts according to their travel time and location, and save the results
in the "result2.csv" file (see result2.csv in Appendix 1 for the file template).
Build a model to analyze the impact of vaccination on the virus transmission index.
According to the number of positive people and the scope of radiation, analyze and determine the places that need key control.
In order to carry out epidemic prevention and control and personnel management more accurately, what relevant data do you think need to be collected? Build a model based on these
data to analyze the effect of precise prevention and control.
Note When solving the above problems, it is required to establish a data warehouse in combination with the data information table provided by the competition to realize the
content of data governance. Please clearly explain in the paper what data governance work has been done and how to achieve it.
! ! Note: The following code is written on Aistudio, so there is no relevant database established. According to the title, you are required to create a database yourself, and then read it in the code.

code download

Code download address: The 11th Teddy Cup Data Mining Challenge-ABC-Baseline

Everyone Fork the project to view all the codes (free)
This project is for learning reference only, encourage everyone to promote learning through the competition, in order to ensure the fairness of the competition (only primary Baseline and simple idea sharing are provided)
If suspected of violating the rules, the project will be deleted as soon as possible
```
  注：思路仅代表作者个人见解，不一定正确。
```

data analysis

Import commonly used packages

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from tqdm import tqdm 
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

When importing the file, we found that there is a problem with the encoding of the attachment, so we need to encapsulate a function to obtain the encoding of the file

# 获取文件编码
import chardet 

def detect_encoding(file_path):
    with open(file_path,'rb') as f:
        data = f.read()
        result = chardet.detect(data)
        return result['encoding']

read all attachments

# 读取人员信息表
df_people = pd.read_csv('../datasets/附件2.csv',encoding = detect_encoding('../datasets/附件2.csv'))
# 读取场所信息表
df_place = pd.read_csv('../datasets/附件3.csv',encoding = detect_encoding('../datasets/附件3.csv'))
# 个人自查上报信息表
df_self_check = pd.read_csv('../datasets/附件4.csv',encoding = detect_encoding('../datasets/附件4.csv'))
# 场所码扫码信息表
df_scan = pd.read_csv('../datasets/附件5.csv',encoding = detect_encoding('../datasets/附件5.csv'))
# 核算采样检测信息表
df_nucleic_acid = pd.read_csv('../datasets/附件6.csv',encoding = detect_encoding('../datasets/附件6.csv'))
# 提交示例1
result = pd.read_csv('../datasets/result1.csv',encoding = detect_encoding('../datasets/result1.csv'))
# 提交示例2
result1 = pd.read_csv('../datasets/result2.csv',encoding = detect_encoding('../datasets/result2.csv'))

Just look at the submission example

# 查看提交示例
result.head()
result1.head()

insert image description here

As can be seen from the submission example, the question should be let us develop a strategy to track close contacts
And obtain other information of close contacts according to the established strategy

Descriptive statistics for each attachment

In order to make descriptive statistics more intuitive, here is a function encapsulated for descriptive statistics

# 数据描述性统计
def summary_stats_table(data):
    '''
    a function to summerize all types of data
    分类型按列的数据分布与异常值统计
    '''
    # count of nulls
    # 空值数量
    missing_counts = pd.DataFrame(data.isnull().sum())
    missing_counts.columns = ['count_null']

    # numeric column stats
    # 数值列数据分布统计
    num_stats = data.select_dtypes(include=['int64','float64']).describe().loc[['count','min','max','25%','50%','75%']].transpose()
    num_stats['dtype'] = data.select_dtypes(include=['int64','float64']).dtypes.tolist()

    # non-numeric value stats
    # 非数值列数据分布统计
    non_num_stats = data.select_dtypes(exclude=['int64','float64']).describe().transpose()
    non_num_stats['dtype'] = data.select_dtypes(exclude=['int64','float64']).dtypes.tolist()
    non_num_stats = non_num_stats.rename(columns={"first": "min", "last": "max"})

    # merge all 
    # 聚合结果
    stats_merge = pd.concat([num_stats, non_num_stats], axis=0, join='outer', ignore_index=False, keys=None,
              levels=None, names=None, verify_integrity=False, copy=True, sort=False).fillna("").sort_values('dtype')

    column_order = ['dtype', 'count', 'count_null','unique','min','max','25%','50%','75%','top','freq']
    summary_stats = pd.merge(stats_merge, missing_counts, left_index=True, right_index=True, sort=False)[column_order]
    return(summary_stats)