News from this site

 Rental advertising space, please contact the webmaster if you need cooperation


+focus
focused

classification  

no classification

tag  

no tag

date  

no datas

Faster Python implementation from bag of words data frame to array

posted on 2024-11-07 20:04     read(682)     comment(0)     like(20)     collect(2)


I have a pandas data frame that defines my bag of word indices and counts like this.

id      word_count  word_idx
15213   1           1192
15213   1           1215
15213   1           1674
15213   1           80
15213   1           179
307     2           103
307     1           80
307     3           1976

I need a fast way to return a matrix of bag of words array. Let's say my vocabulary length is 2000: VOCAB_LEN = 2000

My current solution is TOO SLOW. But here it is:

Function

def to_bow_array(word_idx_list, word_count_list):
    zeros = np.zeros(VOCAB_LEN, dtype = np.uint8)
    zeros[np.array(word_idx_list)] = np.array(word_count_list)
    return zeros

Groupby and apply function

df.groupby('id').apply(lambda row: to_bow_array(list(row['word_idx']),
                                               list(row['word_count'])))

This will return my expected output. For every row, something like array([0, 0, 1, ..., 0, 2, 0], dtype=uint8)

I need a faster implementation. I know that apply should be avoided for fast implementations. How can I achieve this? Thanks


solution


I think you need

s=df.set_index(['id','word_idx'])['word_count'].unstack(fill_value=0).reindex(columns=np.arange(2000),fill_value=0)

Then we convert to tuple ot list

s.apply(tuple,1)
Out[342]: 
id
307      (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
15213    (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
dtype: object


Category of website: technical article > Q&A

Author:qs

link:http://www.pythonblackhole.com/blog/article/246859/7ebb5ced475bd66baa11/

source:python black hole net

Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.

20 0
collect article
collected

Comment content: (supports up to 255 characters)