no classification
no tag
no datas
posted on 2024-11-07 20:04 read(682) comment(0) like(20) collect(2)
I have a pandas data frame that defines my bag of word indices and counts like this.
id word_count word_idx
15213 1 1192
15213 1 1215
15213 1 1674
15213 1 80
15213 1 179
307 2 103
307 1 80
307 3 1976
I need a fast way to return a matrix of bag of words array. Let's say my vocabulary length is 2000: VOCAB_LEN = 2000
My current solution is TOO SLOW. But here it is:
Function
def to_bow_array(word_idx_list, word_count_list):
zeros = np.zeros(VOCAB_LEN, dtype = np.uint8)
zeros[np.array(word_idx_list)] = np.array(word_count_list)
return zeros
Groupby and apply function
df.groupby('id').apply(lambda row: to_bow_array(list(row['word_idx']),
list(row['word_count'])))
This will return my expected output. For every row, something like
array([0, 0, 1, ..., 0, 2, 0], dtype=uint8)
I need a faster implementation. I know that apply
should be avoided for fast implementations. How can I achieve this? Thanks
I think you need
s=df.set_index(['id','word_idx'])['word_count'].unstack(fill_value=0).reindex(columns=np.arange(2000),fill_value=0)
Then we convert to tuple ot list
s.apply(tuple,1)
Out[342]:
id
307 (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
15213 (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
dtype: object
Author:qs
link:http://www.pythonblackhole.com/blog/article/246859/7ebb5ced475bd66baa11/
source:python black hole net
Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.
name:
Comment content: (supports up to 255 characters)
Copyright © 2018-2021 python black hole network All Rights Reserved All rights reserved, and all rights reserved.京ICP备18063182号-7
For complaints and reports, and advertising cooperation, please contact vgs_info@163.com or QQ3083709327
Disclaimer: All articles on the website are uploaded by users and are only for readers' learning and communication use, and commercial use is prohibited. If the article involves pornography, reactionary, infringement and other illegal information, please report it to us and we will delete it immediately after verification!