python-3.x

Automatic topic labelling for topic modelling

shivanshu dhawan

2022年6月2日 13:06

I am just curious to know if there is a way to automatically get the lables for the topics in Topic modelling. It would be really helpful if there's any python implementation of it.

Topic: python-3.x topic-model nlp machine-learning

Category: Data Science

RAM crashed for XML to DataFrame conversion function

Ishan Dutta

2022年5月31日 20:08

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code? Code #Libraries import pandas as pd import xml.etree.cElementTree as ET #Function to convert XML file to Pandas Dataframe def xml2df(file_path): #Parsing XML File and …

Topic: dataframe python-3.x parsing pandas python

Category: Data Science

Expected performance of training tf.keras.Sequential model with model.fit, model.fit_generator and model.train_on_batch

Tuukka Nieminen

2022年5月26日 10:01

I am using Keras with Tensorflow backend to train a simple 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to model.fit for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches …

Topic: python-3.x keras tensorflow

Category: Data Science

SKLearn - Different Results B/w Default Linear Model and1st Order Polynomial Linear Model

Austin Prater

2022年5月25日 18:29

SUMMARY I'm building a linear regression model using Scikit and noticing that the model "performance" (RMSE and max error, namely) varies depending on whether I use the default LR or whether I apply PolynomialFeature(degree=1). My understanding is that these outcomes should be identical, since they are both utilizing a single-order LR model, however, my error is consistently lower when using the PolyFeatures version. TLDR When I run the code below, the second chunk (polynomial = degree of 1) is consistently …

Topic: machine-learning-model python-3.x linear-regression scikit-learn

Category: Data Science

Why yolo4 pytorch re-training loss seems high as like first time training?

Rajesh das

2022年5月25日 00:05

I had a setup a yolo4 pytorch framework in google colab by cloning git clone https://github.com/roboflow-ai/pytorch-YOLOv4.git. I generated checkpoints by giving training. As we need more robust training model, I given training again with assigning pretrained checkpoints but it seems loss started with high value as like first time training. Code is for training !python train.py -b 2 -s 1 -l 0.001 -g 0 -pretrained ./Yolov4_epoch100_latest.pth -classes 1 -dir ./train -epochs 100. Not sure if my pretrained checkpoint is used …

Topic: object-detection yolo pytorch python-3.x deep-learning

Category: Data Science

How to Approach Linear Machine-Learning Model When Input Variables are Inconsistent

Austin Prater

2022年5月24日 23:40

Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals. I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data. This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this …

Topic: python-3.x linear-regression scikit-learn dimensionality-reduction

Category: Data Science

How solved "ValueError: y should be a 1d array, got an array of shape () instead."?

Asma Tolihan

2022年5月24日 08:47

from tkinter import * from tkinter import ttk from tkmacosx import Button top = Tk() top.title("Jobs") top.geometry("1000x800") line1 = LabelFrame(top, text='') line1.pack(expand = 'yes', fill = 'both') n = StringVar() categorychoosen = ttk.Combobox(line1, width = 27, textvariable = n) # Adding combobox drop down list categorychoosen['values'] = ('Advocate','Arts','Automation Testing','Blockchain','Business Analyst', 'Web Designing') categorychoosen.place(x=50, y=150) categorychoosen.current() name=Label(line3,text="Welcom to ... company",font =("Arial", 10)) name.place(x=0, y=0) n1 = StringVar() sectionchoosen = ttk.Combobox(line3, width = 27, textvariable = n1) # Adding combobox drop down …

Topic: k-nn overfitting python-3.x classification python

Category: Data Science

Applying Differencing on a time series, before or after train and test split?

frantic oreo

2022年5月19日 11:44

I am attempting to improve my RNN model by making my dependent variable, a stock price, non-stationary. I am aiming to make the series stationary by removing the trend with a log transformation and then performing moving average differencing to remove noise. I have a function that initially logs the series, to penalise the larger values and then performing rolling mean differencing on the values. def moving_avg_differencing(col, n_roll=30, drop=False): log_values = np.log(col) moving_avg = log_values.rolling(n_roll).mean() ma_diff = log_values - moving_avg …

Topic: python-3.x time-series python

Category: Data Science

I need to plot only training curve in the fastai library using the learner.recorder.plot_losses() function . FASTAI devs pls help

Harshit Joshi

2022年5月18日 19:36

I have a task where I need to only plot the training loss and not the validation loss of the plot_losses function in the fastai library with learner object having recorder class, but I am not able to properly implement the same. I am using the fastai v1 for this purpose due to project restrictions. Here is the github code for the same: class Recorder(LearnerCallback): "A `LearnerCallback` that records epoch, loss, opt and metric data during training." def plot_losses(self, skip_start:int=0, …

Topic: fastai python-3.x computer-vision deep-learning machine-learning

Category: Data Science

Visualization with many lines, colors, and markers

Robyc

2022年5月13日 22:02

I have a bunch of plots as the one reported below. The data is from measurements performed on different times and different days. In the plot (which is a cumulative distribution function, if that matters), the colors differentiate data relevant to different days; the markers are used to further differentiate the data within each day. The problem is that the plot is very crowded and a bit ugly. Some markers can be barely seen. Question: Any idea how I can …

Topic: matplotlib python-3.x visualization

Category: Data Science

Predict continuous variable based on categorical columns mostly

Chris

2022年5月8日 01:05

I have a large dataset (40 mil rows, 50 columns) with mostly categorical columns (some of them are numerical) and I am using Python/Pandas. Categorical columns have up to 3000 unique labels. I am looking for best practices on how to approach this. Obviously one-hot encoding (OHE) as it is is out of question. I have tried to make smaller number of categories and do OHE in that way but the model was very bad, a lot of information is …

Topic: python-3.x pandas

Category: Data Science

Categorical data preprocessing for training a algorithm

spd

2022年5月6日 04:42

I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …

Topic: python-3.x prediction preprocessing categorical-data machine-learning

Category: Data Science

Create new rows based on a value in a column

conradoov

2022年5月2日 18:01

My dateset is generated like the example df = {'event':['A','B','C','D'], 'budget':['123','433','1000','1299'], 'duration_days':['6','3','4','2']} I need to create rows for each event based on the column 'duration_days', if I have duration = 6 the event may have 6 rows: event budget duration_days A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 B 123 3 B 123 3 B 123 3

Topic: data-science-model dataframe python-3.x pandas dataset

Category: Data Science

How to combine and separate test and train data for data cleaning?

Ishan Dutta

2022年5月2日 13:28

I am working on an ML model in which I have been provided the data in 2 files test.csv and train.csv. I want to perform data cleaning on both files together be concatenating them and then separating them. I know how to concatenate 2 dataframes, but after data cleaning how will I separate the two files? Please help me complete the code. CODE test = pd.read_csv('test.csv') train = pd.read_csv('train.csv') df = pd.concat([test, train]) //Data Cleaning steps //Separating them back to …

Topic: dataframe python-3.x pandas dataset python

Category: Data Science

Python: SARIMAX Model Fits too slow

Subhawna

2022年5月1日 14:06

I have a time series data with the date and temperature records of a city. Following are my observations from the time series analysis: By plotting the graph of date vs temperature seasonality is observed. Performing adfuller test we find that the data is already stationary, so d=0. Perform Partial Autocorrelation and Autocorrelation with First Seasonal Difference and found p=2 and q=10 respectively. Code to Train Model model=sm.tsa.statespace.SARIMAX(df['temperature'],order=(1, 1, 1),seasonal_order=(2,0,10,12)) results=model.fit() This fit function runs indefinitely and does not reach …

Topic: colab python-3.x arima time-series python

Category: Data Science

Encode each comma separated value in Pandas

spd

2022年5月1日 04:14

I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …

Topic: categorical-encoding one-hot-encoding python-3.x pandas categorical-data

Category: Data Science

Clustering Tweet Data using DBSCAN Algorithm

Nilani Algiriyage

2022年4月29日 20:22

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters. The following are the parameters that I pass. dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x) The following are the resulting clusters. label -1 1221 0 1349 1 2 2 2 3 4 ... …

Topic: python-3.x text dbscan scikit-learn clustering

Category: Data Science

Tensorflow 2 Semantic Segmentation - loss function for two classes

S_S

2022年4月26日 14:24

I am trying to implement a U-net for semantic segmentation with two classes (foreground=1 and background=0) in the segmentation mask images, following this tutorial. They have used SparseCategoricalCrossentropy for OUTPUT_CLASSES = 3, as shown below: OUTPUT_CLASSES = 3 model = unet_model(output_channels=OUTPUT_CLASSES) model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) EPOCHS = 20 VAL_SUBSPLITS = 5 VALIDATION_STEPS = info.splits['test'].num_examples//BATCH_SIZE//VAL_SUBSPLITS model_history = model.fit(train_batches, epochs=EPOCHS, steps_per_epoch=STEPS_PER_EPOCH, validation_steps=VALIDATION_STEPS, validation_data=test_batches, callbacks=[DisplayCallback()]) If I use the same network with OUTPUT_CLASSES = 2 the training loss is NaN. What should I use …

Topic: semantic-segmentation python-3.x keras tensorflow deep-learning

Category: Data Science

Multi-level timeseries forecasting? How to do it?

Azra Tuni

2022年4月25日 13:09

So, I just finished a 48 hr datathon, and I did terribly, to be honest. It was my first datathon. We were given a list of datasets: 5 months of taxi demand data (January to May) Weather dataset Zone neighbors dt (date and time of prediction) And we were told to build a time series forecasting model to forecast the taxi demand. We were told to do it in a forecasting manner, like, Train with January and Test with February, …

Topic: forecasting python-3.x time-series python

Category: Data Science

Get row wise frequency count of words from list in text column pandas

shivanshu dhawan

2022年4月24日 21:02

I have a data frame with a Audio Transcript column from customer care phone conversation. I have created one list with words and sentences words = ["rain", "buy new house", "tornado"] What I need to do is create a column in the data frame which checks these words in the text column row by row and if it presents then update the column with word and it's frequency. For example first row text "I was going to buy new house …

Topic: python-3.x word-embeddings text-mining nlp

Category: Data Science

About