posted on 2023-05-21 17:28 read(909) comment(0) like(13) collect(2)
自 2019 年底至今,全国各地陆续出现不同程度的新冠病毒感染疫情,如何控制疫情蔓
延、维持社会生活及经济秩序的正常运行是疫情防控的重要课题。大数据分析为疫情的精准
防控提供了高效处置、方便快捷的工具,特别是在人员的分类管理、传播途径追踪、疫情研
判等工作中起到了重要作用,为卫生防疫部门的管理决策提供了可靠依据。疫情数据主要包
括人员信息、场所信息、个人自查上报信息、场所码扫码信息、核酸采样检测信息、疫苗接
种信息等。
本赛题提供了某市新冠疫情防疫系统的相关数据信息,请根据这些数据信息进行综合分
析,主要任务包括数据仓库设计、疫情传播途径追踪、传播指数估计及疫情趋势研判等。
Code download address: The 11th Teddy Cup Data Mining Challenge-ABC-Baseline
Everyone Fork the project to view all the codes (free)
This project is for learning reference only, encourage everyone to promote learning through the competition, in order to ensure the fairness of the competition (only primary Baseline and simple idea sharing are provided)
If suspected of violating the rules, the project will be deleted as soon as possible
注:思路仅代表作者个人见解,不一定正确。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# 获取文件编码
import chardet
def detect_encoding(file_path):
with open(file_path,'rb') as f:
data = f.read()
result = chardet.detect(data)
return result['encoding']
# 读取人员信息表
df_people = pd.read_csv('../datasets/附件2.csv',encoding = detect_encoding('../datasets/附件2.csv'))
# 读取场所信息表
df_place = pd.read_csv('../datasets/附件3.csv',encoding = detect_encoding('../datasets/附件3.csv'))
# 个人自查上报信息表
df_self_check = pd.read_csv('../datasets/附件4.csv',encoding = detect_encoding('../datasets/附件4.csv'))
# 场所码扫码信息表
df_scan = pd.read_csv('../datasets/附件5.csv',encoding = detect_encoding('../datasets/附件5.csv'))
# 核算采样检测信息表
df_nucleic_acid = pd.read_csv('../datasets/附件6.csv',encoding = detect_encoding('../datasets/附件6.csv'))
# 提交示例1
result = pd.read_csv('../datasets/result1.csv',encoding = detect_encoding('../datasets/result1.csv'))
# 提交示例2
result1 = pd.read_csv('../datasets/result2.csv',encoding = detect_encoding('../datasets/result2.csv'))
# 查看提交示例
result.head()
result1.head()
# 数据描述性统计 def summary_stats_table(data): ''' a function to summerize all types of data 分类型按列的数据分布与异常值统计 ''' # count of nulls # 空值数量 missing_counts = pd.DataFrame(data.isnull().sum()) missing_counts.columns = ['count_null'] # numeric column stats # 数值列数据分布统计 num_stats = data.select_dtypes(include=['int64','float64']).describe().loc[['count','min','max','25%','50%','75%']].transpose() num_stats['dtype'] = data.select_dtypes(include=['int64','float64']).dtypes.tolist() # non-numeric value stats # 非数值列数据分布统计 non_num_stats = data.select_dtypes(exclude=['int64','float64']).describe().transpose() non_num_stats['dtype'] = data.select_dtypes(exclude=['int64','float64']).dtypes.tolist() non_num_stats = non_num_stats.rename(columns={"first": "min", "last": "max"}) # merge all # 聚合结果 stats_merge = pd.concat([num_stats, non_num_stats], axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True, sort=False).fillna("").sort_values('dtype') column_order = ['dtype', 'count', 'count_null','unique','min','max','25%','50%','75%','top','freq'] summary_stats = pd.merge(stats_merge, missing_counts, left_index=True, right_index=True, sort=False)[column_order] return(summary_stats)
The analysis results below are based on sample data
A total of 50 pieces of data for people.
The age range is [5, 84]
Tips: The age span is relatively large. Naturally, we can do feature engineering based on age.
In gender, there are a total of three categories (there may be "unknown" categories), and only two categories are given in the title.
Tips: If you need to analyze or feature engineering based on gender later, you need to consider how to deal with the third category.
The nation has only one category, and there is a high probability that there will not be only one category in the full amount of data
Tips: If you need to use this column for aggregation analysis or feature engineering later, you can write dynamic code in Baseline .
Both birthdate and create_time correspond to 50 different times here
Tips: Pay attention to the beginning and end of the time, and compare it with other associated time, so that you can dig out more information or filter out some abnormal situations.
在全量数据中,时间大概率是有重复值的,也要考虑重复时间是否对解题有一定的影响亦或者重复时间的含义。
Data description of site information table
grid_point_id: site ID, used to uniquely identify a site.
name: The name of the place, such as a company, restaurant, supermarket, etc.
point_type: the type of place, such as business, entertainment, culture, medical care, etc.
x_coordinate: The X coordinate of the place, in meters, used to represent the location of the place on the map.
y_coordinate: Y coordinate of the place, in meters, used to represent the location of the place on the map.
create_time: The creation time of the record, which is used to record the update time of the location information.
The analysis results below are based on sample data
X, Y coordinates, which may be a good data to visualize
Tips: Other features can be visualized according to X and Y coordinates (including but not limited to name, point_type)
但是需要注意的是这只是示例数据,全量数据可能会比较庞大,可视化出来的效果可能没有理想那么好
name is the name of the venue, there is no duplicate data in the sample data (but it does not mean that there will be no duplicate data in the full amount of data)
Tips: For repeated place names, can they be aggregated for data statistics? or other
point_type place type, there are a total of 17 different place types in the sample data, among which the place type is entertainment is the most
Tips: Entertainment is only the result of sample data, not necessarily full data. More data analysis and perhaps feature engineering can be done based on this list of features
全量数据中有可能出现不同样本中X,Y值相同而对应的name或point_type等其它特征不同的情况。具体问题具体分析,不要什么都当作异常值
Data Description of Personal Self-examination and Reporting Information Form
sno: serial number, used to uniquely identify a self-inspection record.
user_id: Personnel ID, corresponding to the user_id in the "Personnel Information Table", used to associate self-examination records with corresponding personnel.
x_coordinate: The X coordinate of the reported location, in meters, used to indicate the location of the reported location on the map.
y_coordinate: The Y coordinate of the reported location, in meters, used to indicate the location of the reported location on the map.
symptom: Symptoms, used to record the symptoms of self-examiners. The optional values are: 1 Fever, 2 Fatigue, 3 Dry cough, 4 Nasal congestion, 5 Runny nose, 6 Diarrhea, 7 Dyspnea, 8 Asymptomatic.
nucleic acid_result: Nucleic acid test result, used to record the nucleic acid test status of the self-examiner. Possible values are: 0 Negative, 1 Positive, 2 Unknown (optional).
resident_flag: Whether it is a permanent resident, used to record the living conditions of the self-examiner. Possible values are: 0 unknown, 1 yes, 2 no.
dump_time: Reporting time, used to record the reporting time of self-inspection records.
The analysis results below are based on sample data
Symptom, it can be seen that the symptom category is incomplete in the sample data, and almost all of them are asymptomatic in the sample data (8)
Tips: In the full amount of data, all types of data should exist, so when writing the Baseline, you can consider writing the data analysis and visualization code first.
这一列特征还有一个特点是,在特征工程的时候,可以很好的和其它特征衍生出很多可解释性的交叉特征(eg:symptom-nucleic_acid_result)
Nucleic_acid_result, resident_flag are the same
The X and Y coordinates here are different from those in the above table, you can dig out the difference between the two
Tips: According to the X and Y coordinates, it can be determined where the person reported the information.
dump_time (reporting time), you can combine this column with nuclear_acid_result, X, Y coordinates, you can dig out the occasions and people around the positive patients during the reporting period
Data Description of Site Code Scanning Information Table
sno: serial number, used to uniquely identify a scan code record.
grid_point_id: location ID, corresponding to the grid_point_id in the "location information table", used to associate the scanning record with the corresponding location.
user_id: Personnel ID, corresponding to the user_id in the "personnel information table", used to associate the scan code record with the corresponding person.
temperature: Body temperature, used to record the body temperature of the code reader.
create_time: Scan code record time, used to record the time stamp of the scan code record.
The analysis results below are based on sample data
temperature (body temperature), in the example data, the minimum value is 36, and the maximum value is 37. These values seem to be in the range of normal human body temperature
Tips: You can cross-analyze this column with the characteristics in the personal information table. There is a high probability that there will be a body temperature of about 39 or higher in the full amount of data, so it is best to take this into consideration when writing the Baseline.
create_time (code scanning record time), we can take the time recorded by scanning the code as the person’s real-time body temperature time, and then compare it with the characteristics and time of other tables
Tips: For example, it can be compared with the reporting time and sampling date of the personal self-inspection reporting information form
Nucleic acid sampling and testing information table
data description
sno: serial number, used to uniquely identify a nucleic acid sampling record.
user_id: Personnel ID, corresponding to the user_id in the "Personnel Information Table", used to associate nucleic acid sampling records with corresponding personnel.
cysj: sampling date and time, used to record the date and time of nucleic acid sampling.
jcsj: detection date and time, used to record the date and time of nucleic acid detection.
jg: test result, used to record the result of nucleic acid test. Possible values are: Negative, Positive, Unknown.
grid_point_id: site ID, corresponding to the grid_point_id in the "site information table", used to associate nucleic acid sampling records with the corresponding site.
The analysis results below are based on sample data
There are two times here, one is the sampling time and the other is the detection time, so logically speaking, the detection time will be later than the sampling time
Tips: Be careful sailing for ten thousand years, we add a judgment here, if the judgment is true, then the sample can be regarded as an outlier
For the result column, the results are all negative in the example
Tips: We know that it will have three categories in total, so try to take it into account when writing Baseline
Since the first two questions did not involve Annex 7, there is no import here
Data visualization suggests that
the following visualizations can be done during data analysis
Single table visualization
Personnel information table: Demographic analysis can be performed, such as the distribution of gender, age, ethnicity, etc. It can also be associated with other tables through personnel ID.
Site information table: geographic information analysis can be performed, such as site distribution, site type distribution, site density, etc.
Personal self-examination and reporting information form: It can conduct epidemic monitoring and analysis, such as analysis of symptom distribution, correlation analysis between symptoms and nucleic acid test results, and location distribution of reporting personnel.
Site code scanning information table: It can carry out epidemic monitoring and analysis, such as the distribution of scanning code records, correlation analysis between scanning code records and nucleic acid test results, etc.
Nucleic acid sampling and testing information form: It can carry out epidemic monitoring and analysis, such as the distribution of positive people, analysis of positive rate of nucleic acid testing, contact places and close contacts of positive people, etc.
Correlation Analysis
Personal self-examination and reporting information form and nucleic acid sampling and testing information form: It can analyze the relationship between symptoms reported by individuals and nucleic acid test results, as well as the impact of symptoms and test results on groups of different ages, genders, and ethnicities.
Place information table and place code scanning information table: It can analyze the code scanning situation in different places, and understand which places are easier for people to scan codes; it can also analyze the situation of people with abnormal body temperature in places, and understand which places have loopholes in the epidemic prevention work.
Personal self-examination report information form and site code scanning information form: According to the symptoms reported by individual self-examination, analyze the occurrence of symptoms in different places, and understand which places need to further strengthen the epidemic prevention measures.
Nucleic acid sampling and testing information form, personal self-examination and reporting information form, and site code scanning information form: it can analyze the travel situation of positive people, track close contacts, and take timely isolation measures.
Based on the nucleic acid test records of a positive person, Baseline can find out the places he has been to within 14 days before and after the test, and then find out the people who have been to these places, and then identify possible close contacts. The specific implementation steps are as follows:
First, through the incoming positive person ID, the test record of the positive person is screened out in the nucleic acid test records, and the sampling and testing time of the positive person are obtained.
Next, according to the place ID of the positive person at the time of sampling, the list of places where the first positive person is located is determined.
Then, splicing the positive person’s ID and the location code scanning information table to obtain the places where the positive person went for 14 days before and after (the list of places where the second positive person is located).
Merge the two place lists and deduplicate them.
Finally, merge all User_ids in the location code scanning information table with the location information table, and filter through the location list and time to track the ID of the close contact
Based on the nucleic acid test records, Baseline can find out the places that positive people have visited within 14 days before and after the test, and find out possible close contacts through these places.
# 获取阳性者信息 positive_user_id = df_nucleic_acid[df_nucleic_acid['jg'] =='阳性']['user_id'].values.tolist() def Potential_contacts(df_people,df_place,df_self_check,df_scan,df_nucleic_acid,positive_user_id): # 筛选出阳性者的核酸检测记录 df_positive_test = df_nucleic_acid[df_nucleic_acid['user_id'] == positive_user_id] # 获取阳性者的检测时间 positive_test_time = pd.to_datetime(df_positive_test['cysj'].iloc[0]) df_self_check['dump_time'] = pd.to_datetime(df_self_check['dump_time']) df_scan['create_time'] = pd.to_datetime(df_scan['create_time']) # 获得阳性人员核酸检测的场所 positive_users_place1 = pd.merge(df_positive_test, df_place, on='grid_point_id')['name'].tolist() # 获得阳性人员在测验时间前后14天去的场所 positive_users_place2 = pd.merge(df_positive_test, df_scan, on='user_id')[['user_id','create_time','cysj','grid_point_id_y']] # 计算前14天和后14天 delta = pd.Timedelta(days=14) # 计算最小时间和最大时间 min_date = positive_users_place2['cysj'] - pd.Timedelta(days=14) max_date = positive_users_place2['cysj'] + pd.Timedelta(days=14) # 筛选出符合要求的数据 mask = (positive_users_place2['create_time'] >= min_date) & (positive_users_place2['create_time'] <= max_date) positive_users_place2 = positive_users_place2.loc[mask, ['user_id', 'grid_point_id_y']] positive_users_place2 = positive_users_place2.rename(columns={'grid_point_id_y': 'grid_point_id'}) positive_users_place2 = pd.merge(positive_users_place2, df_place, on='grid_point_id')['name'].tolist() # 将两个列表合并去重 positive_place = list(set(positive_users_place1+positive_users_place2)) # 获取去过上述场所的人员 # 按照密接时间筛选 df_potential_contacts = df_scan[(df_scan['create_time'] >= positive_test_time - pd.Timedelta('14D')) & (df_scan['create_time'] <= positive_test_time + pd.Timedelta('14D'))] # 按照场所筛选 df_potential_contacts = df_potential_contacts[df_potential_contacts['grid_point_id'].isin(df_place[df_place['name'].isin(positive_place)]['grid_point_id'])] # 整合信息并按照要求输出 result = pd.DataFrame({ '序号': range(1, len(df_potential_contacts)+1), '密接者ID': df_potential_contacts['user_id'].values, '密接日期': df_potential_contacts['create_time'].dt.date.astype(str), '密接场所ID': df_potential_contacts['grid_point_id'].values, '阳性人员ID': [positive_user_id] * len(df_potential_contacts) }) return result
For this question, a function named Potential_contacts is encapsulated. The purpose of this function is to find all the information of people who may have contact with the positive person.
The specific logic of the function is as follows:
整合潜在密接者的信息和阳性者的信息,并返回一个数据框,其中包含序号、密接者 ID、密接日期、密接场所 ID 和阳性人员 ID 等信息。
def get_sub_contacts(df_potential_contacts, df_scan): # 修改列名,方便拼接 df_potential_contacts = df_potential_contacts.rename(columns = {'密接场所ID':'grid_point_id'}) # 筛选出所有在密接者场所出入过的UserID contacts = pd.merge(df_potential_contacts,df_scan,on='grid_point_id') # 去除密接者ID contacts = contacts.drop(contacts[contacts['密接者ID'] == contacts['user_id']].index) # 计算前后半个小时 delta = pd.Timedelta(minutes=30) # 筛选出次密接者 mask = (contacts['create_time'] >= contacts['密接日期']-delta) & (contacts['create_time'] <= contacts['密接日期']+delta) contacts = contacts[mask] # 整合信息并按照要求输出 result = pd.DataFrame({ '序号': range(1, len(contacts)+1), '次密接者ID': contacts['user_id'].values, '次密接日期': contacts['create_time'].values, '次密接场所ID': contacts['grid_point_id'].values, '阳性人员ID': contacts['密接者ID'].values }) return result
该函数的具体逻辑如下:
为了找到次密接者,我们需要了解密接者密接期间所在的地点和与之接触的人员。
因此,该函数首先将两个表格 (df_potential_contacts 和 df_scan)合并,找到所有在与密接者相同的场所出现过的用户。
接下来,该函数将筛选出密接者密接时间的前后半小时内有过接触的用户。
具体地,该函数使用 Pandas 库中的 Timedelta 函数设置一个时间窗口,
然后将 create_time 列中的日期和时间与 密接日期 进行比较,筛选出在这个时间窗口内的接触记录。
最后,根据result格式输出答案
病毒传播指数可以根据现有的数据表格中的核酸采样检测信息表和场所码扫码信息表来计算。一种常用的方法是使用传染病流行病学中的基本再生数R0,它代表一个感染者平均会传染多少其他人。
首先,我们可以根据场所码扫码信息表中的数据计算每个场所的平均体温,并根据该平均体温和感染者的体温来确定感染概率。最后,我们可以使用基本再生数公式(R0 = 感染概率 × 平均接触人数)来计算病毒传播指数。
具体步骤如下:
根据场所码扫码信息表中的数据,确定每个场所的温度分布的平均值和标准偏差。
根据场所码扫码信息表中的数据,确定感染者的体温。
根据感染者的体温和场所的平均体温,计算感染概率,例如P = exp(-(38-37)2/2σ2),其中σ是体温分布的标准差。
根据密接者表中的数据,确定平均接触人数。
计算基本再生数R0 = 感染概率 × 平均接触人数。
将计算出的病毒传播指数与接种疫苗信息表中的数据进行比较,分析接种疫苗对病毒传播指数的影响。
df_scan['temperature_mean'] = df_scan.groupby('grid_point_id')['temperature'].transform('mean')
df_scan['temperature_std'] = df_scan.groupby('grid_point_id')['temperature'].transform('std')
df_positive = pd.merge(df_scan, df_nucleic_acid[df_nucleic_acid['jg'] == '阳性'][['user_id']], on='user_id', how='inner')
df_positive['temperature_mean_positive'] = df_positive.groupby('grid_point_id')['temperature'].transform('mean')
df_positive['infection_prob'] = np.exp(-((df_positive['temperature_mean_positive'] - df_positive['temperature_mean']) ** 2) / (2 * df_positive['temperature_std'] ** 2))
# 根据阳性人员ID和密接场所ID进行分组,并统计每个组内密接者数量
grouped = result1.groupby(['阳性人员ID', '密接场所ID'])['密接者ID'].count().reset_index()
# 将统计结果返回到原数据集中
result1 = pd.merge(result1, grouped, on=['阳性人员ID', '密接场所ID'], how='left')
result1_count = result1.rename(columns={'密接者ID_y': '密接者数量'})
result1_count['平均接触人数'] = result1_count.groupby(['阳性人员ID', '密接场所ID'])['密接者数量'].transform('mean')
df_positive = pd.merge(result1_count, df_positive, left_on='阳性人员ID', right_on='user_id')
df_positive['label'] = df_positive['infection_prob'] * df_positive['平均接触人数']
# 疫苗接种信息表
df_vaccine_info = pd.read_csv('../datasets/附件7.csv',encoding = detect_encoding('../datasets/附件7.csv'))
df = pd.merge(df_vaccine_info, df_positive, on='user_id')
# 去除没有label的数据
df = df.dropna()
col = ['age', 'gender','inject_times', 'vaccine_type','label'] df = df[col] # 对类别列进行数值编码(你也可以用其它编码进行特征工程) from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestRegressor # 创建 LabelEncoder 对象 le = LabelEncoder() # 对 nject_times 和 vaccine_type 进行数值编码 df['nject_times'] = le.fit_transform(df['nject_times']) df['vaccine_type'] = le.fit_transform(df['vaccine_type']) # 创建随机森林回归模型 rf = RandomForestRegressor(n_estimators=100, random_state=2023) # 拟合数据 X= df.drop('label',axis=1) y = df['label'] rf.fit(X, y) # 得到特征重要性 importances = rf.feature_importances_ # 将特征重要性排序 indices = np.argsort(importances)[::-1] # 将特征名称按照重要性排序 names = [f'Feature {i}' for i in range(X.shape[1])] sorted_names = [names[i] for i in indices] # 绘制特征重要性柱状图 plt.figure() plt.title("Feature Importance") plt.bar(range(X.shape[1]), importances[indices]) plt.xticks(range(X.shape[1]), sorted_names, rotation=90) plt.show()
The scheme of Task3 belongs to throwing bricks to attract jade, just refer to the ideas
In this scheme, many factors are not taken into account, only age, gender, and vaccination category are considered in the modeling
In this scheme, there are many decisions that can be optimized, such as the temperature distribution of each site
Tips: If you want to use the temperature distribution of each place defined by this code, you need to make assumptions. Because the temperature distribution changes with time. So when you do it, you can take time into account, and calculate the temperature distribution at each moment according to the time.
Another is the definition of the virus transmission index. In this scheme, we think that the basic reproduction number is our virus transmission index. The definition of label is different, which is likely to affect the solution of the whole question.
In short, it is relatively one-sided, and the consideration is not very comprehensive. If there is anything you don’t understand, just leave a message below. It is only for learning and communication (not necessarily correct)
Just wait for the update
Author:kkkkkkkkkkdsdsd
link:http://www.pythonblackhole.com/blog/article/25275/9c146d33d5d9fa06b8b3/
source:python black hole net
Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.
name:
Comment content: (supports up to 255 characters)
Copyright © 2018-2021 python black hole network All Rights Reserved All rights reserved, and all rights reserved.京ICP备18063182号-7
For complaints and reports, and advertising cooperation, please contact vgs_info@163.com or QQ3083709327
Disclaimer: All articles on the website are uploaded by users and are only for readers' learning and communication use, and commercial use is prohibited. If the article involves pornography, reactionary, infringement and other illegal information, please report it to us and we will delete it immediately after verification!