Topic Modelling in an existing dataframe in python

I am trying to perform topic extraction in a panda dataframe. I am using LDA topic modeling in order to extract the topics in my dataframe. No problem.

But, I would like to apply LDA topic modeling to each row in my dataframe.

Current datafame:

date cust_id words
3/14/2019 100001 samantha slip skirt pi ski
1/21/2020 10002 steel skirt solid greenish
5/19/2020 10003 arizona denim blouse d

The dataframe I am looking for:

date cust_id words topic 0 words topic 0 weights
3/14/2019 100001 samantha slip skirt pi ski skirt 0.5
1/21/2020 10002 skirt solid greenish greenish 0.2
5/19/2020 10003 arizona denim blouse denim 01

vectorizer = CountVectorizer(max_df=0.9, min_df=20, token_pattern='\w+|\$[\d.]+|\S+')

tf = vectorizer.fit_transform(features['words']).toarray()

tf_feature_names = vectorizer.get_feature_names()

number_of_topics = 6 model = LatentDirichletAllocation(n_components=number_of_topics, random_state=1111)

model.fit(tf)


I tried to merge two dataframe together, it does not work.

How will I be able to add each topic in each column and add each topic weights to add to all my rows?

I posted the question in stackoverflow: https://stackoverflow.com/questions/71476309/topic-modelling-in-an-existing-dataframe-in-python

Topic dataframe pandas lda topic-model python

Category Data Science


You can try this:

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
   
    sent_topics_df = pd.DataFrame()

   
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # -- dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)
    

df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=df)


df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']


df_dominant_topic.head(5)

You can find the detailed implementation in this Kaggle Notebook

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.