News from this site

 Rental advertising space, please contact the webmaster if you need cooperation


+focus
focused

classification  

no classification

tag  

no tag

date  

2024-11(11)

Python 大作业 网易云歌单数据分析及可视化(参考多位博主文章)

posted on 2023-05-21 18:21     read(931)     comment(0)     like(28)     collect(1)


Table of contents

Project Overview

1.1 Project source

1.2 Requirement description

data collection

2.1 Selection of data sources

2.2 Data Acquisition

2.2.1 Design

2.2.2 Implementation

2.2.3 Effect

 data preprocessing

3.1 Design

3.2 Implementation

3.3 Effect

 Data Analysis and Visualization

4.1 Top 10 Playlists

4.1.1 Implementation

4.1.2 Results

4.1.3 Visualization

4.2 Top 10 song list favorites

4.2.1 Implementation

4.2.2 Results

4.2.3 Visualization

4.3 Top 10 song list comments

4.3.1 Implementation

4.3.2 Results

4.3.3 Visualization

4.4 Distribution of songs included in playlists

4.4.1 Implementation

4.4.2 Effect and visualization

4.4.3 Analysis

4.5 Song list label map

4.5.1 Implementation

4.5.2 Results

4.5.3 Visualization

4.5.4 Analysis

4.6 Top 10 Song List Contributions

4.6.1 Implementation

4.6.2 Results

4.6.3 Visualization

4.7 Song list name generation word cloud

4.4.1 Implementation

4.7.2 Results and visualization

4.8 Code implementation

 conclusion


Project Overview

1.1 Project source

NetEase Cloud Music is a music product developed by NetEase. It is the result of NetEase Hangzhou Research Institute. Relying on professional musicians, DJs, friend recommendations and social functions, the online music service mainly focuses on playlists, social networking, big-name recommendations and music fingerprints. Playlists, DJ programs, social networking, and geographic location are the core elements, focusing on discovery and sharing. Crawl the song list part of the official website of Netease Cloud Music, obtain data from the song list of Netease Cloud Music, get all the song lists of a certain song style, and get the name, label, introduction, collection volume, playback volume, and song list of the song list The number of songs included in the list, as well as the number of comments.

1.2 Requirement description

Preprocess the crawled data, analyze the preprocessed data, and analyze the playlist volume, playlist collection volume, songlist comment volume, songlist song collection status, songlist labels, and playlist contribution up The master and the like conduct analysis and visualization to reflect the analysis results more intuitively.

  • data collection

2.1 Selection of data sources

Listening to music is a way for many young people to express their emotions. Netease Cloud Music is a popular music platform. By analyzing the song list of Netease Cloud Music, we can understand the problems faced by young people in today's society, and Emotional pressure in all aspects; you can also understand the preferences of users, analyze what kind of song list is the most popular among the public, and reflect the preferences of the public, which also plays a very important role in the creation of music creators. From the point of view of ordinary users, for creators of playlists, on the one hand, creation of playlists facilitates the classification and management of their own collection of music libraries; on the other hand, producing high-quality playlists can highlight their own music taste, Gain likes and comments, and get a great sense of accomplishment and satisfaction. For song list consumers, listening to songs based on the "song list" can greatly improve the user experience of listening to songs. For songlist creators such as musicians and radio hosts, the "songlist" can better disseminate their music and works, and also better interact with fans and expand their popularity.

This project crawls the data of the Chinese song list on the official website of Netease Cloud, and the crawling address is: Chinese song list - song list - NetEase Cloud Music

2.2 Data Acquisition

2.2.1 Design

Enter each page, get each song list of the page, and enter a single song list, the name of the song list, the number of favorites, the number of comments, tags, introduction, total number of songs, number of plays, titles of songs and other data are stored on the web page In the same div of , select each content through the selector selector .

2.2.2 Implementation

  1. from bs4 import BeautifulSoup  
  2. import requests  
  3. import time  
  4.   
  5. headers = {  
  6.     'User-Agent''Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'  
  7. }  
  8.   
  9. for i in range(0133035):  
  10.     print(i)  
  11.     time.sleep(2)  
  12.     url = 'https://music.163.com/discover/playlist/?cat=华语&order=hot&limit=35&offset=' + str(i)#修改这里即可  
  13.     response = requests.get(url=url, headers=headers)  
  14.     html = response.text  
  15.     soup = BeautifulSoup(html, 'html.parser')  
  16.     # 获取包含歌单详情页网址的标签  
  17.     ids = soup.select('.dec a')  
  18.     # 获取包含歌单索引页信息的标签  
  19.     lis = soup.select('#m-pl-container li')  
  20.     print(len(lis))  
  21.     for j in range(len(lis)):  
  22.         # 获取歌单详情页地址  
  23.         url = ids[j]['href']  
  24.         # 获取歌单标题  
  25.         title = ids[j]['title']  
  26.         # 获取歌单播放量  
  27.         play = lis[j].select('.nb')[0].get_text()  
  28.         # 获取歌单贡献者名字  
  29.         user = lis[j].select('p')[1].select('a')[0].get_text()  
  30.         # 输出歌单索引页信息  
  31.         print(url, title, play, user)  
  32.         # 将信息写入CSV文件中  
  33.         with open('playlist.csv''a+', encoding='utf-8-sig'as f:  
  34.             f.write(url + ',' + title + ',' + play + ',' + user + '\n')  
  35.   
  36. from bs4 import BeautifulSoup  
  37. import pandas as pd  
  38. import requests  
  39. import time  
  40.   
  41. df = pd.read_csv('playlist.csv', header=None, error_bad_lines=False, names=['url''title''play''user'])  
  42.   
  43. headers = {  
  44.     'User-Agent''Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'  
  45. }  
  46.   
  47. for i in df['url']:  
  48.     time.sleep(2)  
  49.     url = 'https://music.163.com' + i  
  50.     response = requests.get(url=url, headers=headers)  
  51.     html = response.text  
  52.     soup = BeautifulSoup(html, 'html.parser')  
  53.     # 获取歌单标题  
  54.     title = soup.select('h2')[0].get_text().replace(','',')  
  55.     # 获取标签  
  56.     tags = []  
  57.     tags_message = soup.select('.u-tag i')  
  58.     for p in tags_message:  
  59.         tags.append(p.get_text())  
  60.     # 对标签进行格式化  
  61.     if len(tags) > 1:  
  62.         tag = '-'.join(tags)  
  63.     else:  
  64.         tag = tags[0]  
  65.     # 获取歌单介绍  
  66.     if soup.select('#album-desc-more'):  
  67.         text = soup.select('#album-desc-more')[0].get_text().replace('\n''').replace(','',')  
  68.     else:  
  69.         text = '无'  
  70.     # 获取歌单收藏量  
  71.     collection = soup.select('#content-operation i')[1].get_text().replace('(''').replace(')''')  
  72.     # 歌单播放量  
  73.     play = soup.select('.s-fc6')[0].get_text()  
  74.     # 歌单内歌曲数  
  75.     songs = soup.select('#playlist-track-count')[0].get_text()  
  76.     # 歌单评论数  
  77.     comments = soup.select('#cnt_comment_count')[0].get_text()  
  78.     # 输出歌单详情页信息  
  79.     print(title, tag, text, collection, play, songs, comments)  
  80.     # 将详情页信息写入CSV文件中  
  81.     with open('music_message.csv''a+', encoding='utf-8'as f:  
  82.         # f.write(title + '/' + tag + '/' + text + '/' + collection + '/' + play + '/' + songs + '/' + comments + '\n')  
  83.         f.write(title + ',' + tag + ',' + text + ',' + collection + ',' + play + ',' + songs + ',' + comments + '\n')  

2.2.3 Effect

Store the relevant content in the corresponding .csv file. The music_message.csv file stores the name, label, introduction, number of collections, playback volume, number of songs included in the playlist, and number of comments. The playlist.csv file stores the address of the playlist details page, the title of the playlist, the play volume of the playlist, and the names of contributors to the playlist. The results are shown in Figure 2-1 and 2-2.

 

Figure 2-1  music_message.csv file content

 

Figure 2-2  content of playlist.csv file

Regarding data cleaning, in fact, part of it has been done in the process of capturing data in the previous part, including: empty song list information returned by the background , deduplication of duplicate data, etc. In addition, some cleaning needs to be done: unify the format of the comment volume data , etc.

3.1 Design

Replace the data with "ten thousand" in the number of comments with "0000" to facilitate subsequent data analysis, and fill in the data with "0" in the number of comments that are incorrectly counted, and do not participate in subsequent statistics.

3.2 Implementation

    1. df['collection'] = df['collection'].astype('string').str.strip()  
    2. df['collection'] = [int(str(i).replace('万','0000')) for i in df['collection']]  
    3. df['text'] = [str(i)[3:] for i in df['text']]  
    4. df['comments'] = [0 if '评论' in str(i).strip() else int(i) for i in df['comments']]  

3.3 Effect

 

Figure 3-1 screenshot of program running

4.1 Top 10 Playlists

4.1.1 Implementation

  1. df_play = df[['title','play']].sort_values('play',ascending=False)  
  2. df_play[:10]  
  3. df_play = df_play[:10]  
  4. _x = df_play['title'].tolist()  
  5. _y = df_play['play'].tolist()  
  6. df_play = get_matplot(x=_x,y=_y,chart='barh',title='网易云音乐华语歌单播放 TOP10',ha='left',size=8,color=color[0])  
  7. df_play  

4.1.2 Results

 

Figure 4-1 Screenshot of program running results

4.1.3 Visualization

 

Figure 4-2  NetEase Cloud Music Chinese playlist TOP10

4.2 Top 10 song list favorites

4.2.1 Implementation

    1. df_col = df[['title','collection']].sort_values('collection',ascending=False)  
    2. df_col[:10]  
    3. df_col = df_col[:10]  
    4. _x = df_col['title'].tolist()  
    5. _y = df_col['collection'].tolist()  
    6. df_col = get_matplot(x=_x,y=_y,chart='barh',title='网易云音乐华语歌单收藏 TOP10',ha='left',size=8,color=color[1])  
    7. df_col  

4.2.2 Results

Figure 4-3 Screenshot of program running results

4.2.3 Visualization

 

Figure 4-4 Netease Cloud Music Chinese song list collection TOP10

4.3 Top 10 song list comments

4.3.1 Implementation

    1. df_com = df[['title','comments']].sort_values('comments',ascending=False)  
    2. df_com[:10]  
    3. df_com = df_com[:10]  
    4. _x = df_com['title'].tolist()  
    5. _y = df_com['comments'].tolist()  
    6. df_com = get_matplot(x=_x,y=_y,chart='barh',title='网易云音乐华语歌单评论数 TOP10',ha='left',size=8,color=color[2])  
    7. df_com  

4.3.2 Results

 

Figure 4-5 Screenshot of program running results

4.3.3 Visualization

 

Figure 4-6 NetEase Cloud Music Chinese song list comments TOP10

4.4 Distribution of songs included in playlists

4.4.1 Implementation

  1. df_songs = np.log(df['songs'])  
  2. df_songs  
  3. df_songs = get_matplot(x=0,y=df_songs,chart='hist',title='华语歌单歌曲收录分布情况',ha='left',size=10,color=color[3])  
  4. df_songs  

4.4.2 Effect and visualization

 

Figure 4-7 Distribution of Chinese song lists

4.4.3 Analysis

Through the analysis of the column chart, it is found that the collection of songs in the playlist is mostly concentrated in 20-60 songs, at most more than 80 songs, and there is also the phenomenon of empty playlists, but most of the playlists contain more than 10 songs. Through this visual analysis, subsequent creators can provide assistance to the collection of songs in their own playlists . It can also be more popular with the public.

4.5 Song list label map

4.5.1 Implementation

  1. def get_tag(df):  
  2.     df = df['tag'].str.split('-')  
  3.     datalist = list(set(x for data in df for x in data))  
  4.     return  datalist  
  5. df_tag = get_tag(df)  
  6. # df_tag  
  7. def get_lx(x,i):  
  8.     if i in str(x):  
  9.         return 1  
  10.     else:  
  11.         return 0  
  12. for i in list(df_tag):#这里的df['all_category'].unique()也可以自己用列表构建,我这里是利用了前面获得的  
  13.     df[i] = df['tag'].apply(get_lx,i=f'{i}')  
  14. # df.head()  
  15. Series = df.iloc[:,7:].sum().sort_values(0,ascending=False)  
  16. df_tag = [tag for tag in zip(Series.index.tolist(),Series.values.tolist())]  
  17. df_tag[:10]  
  18. df_iex = [index for index in Series.index.tolist()][:20]  
  19. df_tag = [tag for tag in Series.values.tolist()][:20]  
  20. df_tagiex = get_matplot(x=df_iex,y=df_tag,chart='plot',title='网易云音乐华语歌单标签图',size=10,ha='center',color=color[3])  
  21. df_tagiex  

4.5.2 Results

 

Figure 4-8 Chinese song list labels

4.5.3 Visualization

 

Figure 4-9 Song list label map

4.5.4 Analysis

You can see the style of the song list through this label map, and you can analyze the emotions of the current mainstream songs, as well as the needs of the public, as well as the music preferences of NetEase Cloud Music users. The content is relatively diversified: domestic popular, European and American pop, electronic, music and other styles are involved.

4.6 Top 10 Song List Contributions

4.6.1 Implementation

    1. df_user = pd.read_csv('playlist.csv',encoding="unicode_escape",header=0,names=['url','title','play','user'],sep=',')  
    2. df_user.shape  
    3. df_user = df_user.iloc[:,1:]  
    4. df_user['count'] = 0  
    5. df_user = df_user.groupby('user',as_index=False)['count'].count()  
    6. df_user = df_user.sort_values('count',ascending=False)[:10]  
    7. df_user  
    8. df_user = df_user[:10]  
    9. names = df_user['user'].tolist()  
    10. nums = df_user['count'].tolist()  
    11. df_u = get_matplot(x=names,y=nums,chart='barh',title='歌单贡献UP主 TOP10',ha='left',size=10,color=color[4])  
    12. df_u  

4.6.2 Results

 

Figure 4-10 Song list contribution up top ten

4.6.3 Visualization

 

Figure 4-11 Top 10 contributors to the song list

4.7 Song list name generation word cloud

4.4.1 Implementation

  1. import wordcloud  
  2. import pandas as pd  
  3. import numpy as np  
  4. from PIL import Image  
  5. data = pd.read_excel('music_message.xlsx')  
  6. #根据播放量排序,只取前五十个  
  7. data = data.sort_values('play',ascending=False).head(50)  
  8.   
  9. #font_path指明用什么样的字体风格,这里用的是电脑上都有的微软雅黑  
  10. w1 = wordcloud.WordCloud(width=1000,height=700,  
  11.                          background_color='black',  
  12.                          font_path='msyh.ttc')  
  13. txt = "\n".join(i for i in data['title'])  
  14. w1.generate(txt)  
  15. w1.to_file('F:\\词云.png')  

4.7.2 Results and visualization

 

Figure 4-11 Word cloud generated by song list name

4.8 Code implementation

To simplify the code, a generic function is built

get_matplot(x,y,chart,title,ha,size,color)

x means to serve as the x-axis data;

y means to serve as the y-axis data;

chart represents the icon type, which is divided into three types: barh, hist, and squarify.plot;

ha indicates the relative orientation of the text;

size indicates the font size;

color indicates the color of the chart;

    1. def get_matplot(x,y,chart,title,ha,size,color):  
    2.     # 设置图片显示属性,字体及大小  
    3.     plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']  
    4.     plt.rcParams['font.size'] = size  
    5.     plt.rcParams['axes.unicode_minus'] = False  
    6.     # 设置图片显示属性  
    7.     fig = plt.figure(figsize=(168), dpi=80)  
    8.     ax = plt.subplot(111)  
    9.     ax.patch.set_color('white')  
    10.     # 设置坐标轴属性  
    11.     lines = plt.gca()  
    12.     # 设置显示数据  
    13.     if x ==0:  
    14.         pass  
    15.     else:  
    16.         x.reverse()  
    17.         y.reverse()  
    18.         data = pd.Series(y, index=x)  
    19.     # 设置坐标轴颜色  
    20.     lines.spines['right'].set_color('none')  
    21.     lines.spines['top'].set_color('none')  
    22.     lines.spines['left'].set_color((64/25564/25564/255))  
    23.     lines.spines['bottom'].set_color((64/25564/25564/255))  
    24.     # 设置坐标轴刻度  
    25.     lines.xaxis.set_ticks_position('none')  
    26.     lines.yaxis.set_ticks_position('none')  
    27.     if chart == 'barh':  
    28.         # 绘制柱状图,设置柱状图颜色  
    29.         data.plot.barh(ax=ax, width=0.7, alpha=0.7, color=color)  
    30.         # 添加标题,设置字体大小  
    31.         ax.set_title(f'{title}', fontsize=18, fontweight='light')  
    32.         # 添加歌曲出现次数文本  
    33.         for x, y in enumerate(data.values):  
    34.             plt.text(y+0.3, x-0.12'%s' % y, ha=f'{ha}')  
    35.     elif chart == 'hist':  
    36.         # 绘制直方图,设置柱状图颜色  
    37.         ax.hist(y, bins=30, alpha=0.7, color=(21/25547/25571/255))  
    38.         # 添加标题,设置字体大小  
    39.         ax.set_title(f'{title}', fontsize=18, fontweight='light')  
    40.     elif chart == 'plot':  
    41.         colors = ['#adb0ff''#ffb3ff''#90d595''#e48381''#aafbff''#f7bb5f''#eafb50',  
    42.          '#adb0ff''#ffb3ff''#90d595''#e48381''#aafbff''#f7bb5f''#eafb50',  
    43.          '#adb0ff''#ffb3ff''#90d595''#e48381''#aafbff''#f7bb5f''#eafb50',  
    44.          '#adb0ff''#ffb3ff''#90d595''#e48381''#aafbff''#f7bb5f''#eafb50',  
    45.          '#adb0ff''#ffb3ff''#90d595''#e48381''#aafbff''#f7bb5f''#eafb50',  
    46.           '#adb0ff''#ffb3ff''#90d595''#e48381''#aafbff''#f7bb5f''#eafb50',  
    47.           '#adb0ff''#ffb3ff''#90d595''#e48381''#aafbff']  
    48.         plot = squarify.plot(sizes=y, label=x, color=colors, alpha=1, value=y, edgecolor='white', linewidth=1.5)  
    49.         # 设置标签大小为1  
    50.         plt.rc('font', size=6)  
    51.         # 设置标题大小  
    52.         plot.set_title(f'{title}', fontsize=13, fontweight='light')  
    53.         # 除坐标轴  
    54.         plt.axis('off')  
    55.         # 除上边框和右边框刻度  
    56.         plt.tick_params(top=False, right=False)  
    57.     # 显示图片  
    58.     plt.show()  
    59. #构建color序列  
    60. color = [(153/2550/255102/255),(8/25588/255121/255),(160/255102/25550/255),(136/25543/25548/255),(16/255152/255168/255),(153/2550/255102/255)]  

In the process of completing the big homework, I learned a lot of new things, and connected the knowledge I learned in class this semester. When encountering some blurred memory problems, I can perfectly solve them by looking through textbooks and previous live broadcast playbacks and deepen my impression of such problems. Next time I encounter the same problem, I can also give solutions; For problems that have not been covered, I actively search for information on the Internet, and practice the solutions I find until I can actually solve the problem. I know that everyone will encounter various problems in the actual operation process, and there are also areas that they do not understand, and with the development of the network, various things are constantly being updated. What I have obtained Knowledge also needs to be updated, so it is our compulsory course to accurately find information on the Internet and quickly find solutions.

During the completion of this assignment, I also encountered many problems, such as errors in data crawling, visualization failures, and incomprehensible codes. After encountering a problem, I will first check my code by myself, and modify it in time if I find an error. If I encounter a problem that cannot be solved, I will search the error message of the program to find a solution to this error. Fortunately, all the problems I encountered during the completion of this assignment have been resolved.

In the process of completing this homework, bittersweet mixed. In the continuous learning process, there is not only the nervousness of encountering problems that cannot be solved, but also the sense of accomplishment of successfully solving problems. I also learned a lot of knowledge and acquired certain skills accordingly. Thanks to the teachers and classmates for their help, I will be more serious in the future, try to improve my ability, and work harder to learn python and data analysis technology.



Category of website: technical article > Blog

Author:Believesinkinto

link:http://www.pythonblackhole.com/blog/article/25322/d0fa5c6ca801a9ef275a/

source:python black hole net

Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.

28 0
collect article
collected

Comment content: (supports up to 255 characters)