I am just curious to know if there is a way to automatically get the lables for the topics in Topic modelling. It would be really helpful if there's any python implementation of it.
I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code? Code #Libraries import pandas as pd import xml.etree.cElementTree as ET #Function to convert XML file to Pandas Dataframe def xml2df(file_path): #Parsing XML File and …
I am using Keras with Tensorflow backend to train a simple 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to model.fit for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches …
SUMMARY I'm building a linear regression model using Scikit and noticing that the model "performance" (RMSE and max error, namely) varies depending on whether I use the default LR or whether I apply PolynomialFeature(degree=1). My understanding is that these outcomes should be identical, since they are both utilizing a single-order LR model, however, my error is consistently lower when using the PolyFeatures version. TLDR When I run the code below, the second chunk (polynomial = degree of 1) is consistently …
I had a setup a yolo4 pytorch framework in google colab by cloning git clone https://github.com/roboflow-ai/pytorch-YOLOv4.git. I generated checkpoints by giving training. As we need more robust training model, I given training again with assigning pretrained checkpoints but it seems loss started with high value as like first time training. Code is for training !python train.py -b 2 -s 1 -l 0.001 -g 0 -pretrained ./Yolov4_epoch100_latest.pth -classes 1 -dir ./train -epochs 100. Not sure if my pretrained checkpoint is used …
Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals. I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data. This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this …
I am attempting to improve my RNN model by making my dependent variable, a stock price, non-stationary. I am aiming to make the series stationary by removing the trend with a log transformation and then performing moving average differencing to remove noise. I have a function that initially logs the series, to penalise the larger values and then performing rolling mean differencing on the values. def moving_avg_differencing(col, n_roll=30, drop=False): log_values = np.log(col) moving_avg = log_values.rolling(n_roll).mean() ma_diff = log_values - moving_avg …
I have a task where I need to only plot the training loss and not the validation loss of the plot_losses function in the fastai library with learner object having recorder class, but I am not able to properly implement the same. I am using the fastai v1 for this purpose due to project restrictions. Here is the github code for the same: class Recorder(LearnerCallback): "A `LearnerCallback` that records epoch, loss, opt and metric data during training." def plot_losses(self, skip_start:int=0, …
I have a bunch of plots as the one reported below. The data is from measurements performed on different times and different days. In the plot (which is a cumulative distribution function, if that matters), the colors differentiate data relevant to different days; the markers are used to further differentiate the data within each day. The problem is that the plot is very crowded and a bit ugly. Some markers can be barely seen. Question: Any idea how I can …
I have a large dataset (40 mil rows, 50 columns) with mostly categorical columns (some of them are numerical) and I am using Python/Pandas. Categorical columns have up to 3000 unique labels. I am looking for best practices on how to approach this. Obviously one-hot encoding (OHE) as it is is out of question. I have tried to make smaller number of categories and do OHE in that way but the model was very bad, a lot of information is …
I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …
My dateset is generated like the example df = {'event':['A','B','C','D'], 'budget':['123','433','1000','1299'], 'duration_days':['6','3','4','2']} I need to create rows for each event based on the column 'duration_days', if I have duration = 6 the event may have 6 rows: event budget duration_days A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 B 123 3 B 123 3 B 123 3
I am working on an ML model in which I have been provided the data in 2 files test.csv and train.csv. I want to perform data cleaning on both files together be concatenating them and then separating them. I know how to concatenate 2 dataframes, but after data cleaning how will I separate the two files? Please help me complete the code. CODE test = pd.read_csv('test.csv') train = pd.read_csv('train.csv') df = pd.concat([test, train]) //Data Cleaning steps //Separating them back to …
I have a time series data with the date and temperature records of a city. Following are my observations from the time series analysis: By plotting the graph of date vs temperature seasonality is observed. Performing adfuller test we find that the data is already stationary, so d=0. Perform Partial Autocorrelation and Autocorrelation with First Seasonal Difference and found p=2 and q=10 respectively. Code to Train Model model=sm.tsa.statespace.SARIMAX(df['temperature'],order=(1, 1, 1),seasonal_order=(2,0,10,12)) results=model.fit() This fit function runs indefinitely and does not reach …
I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …
I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters. The following are the parameters that I pass. dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x) The following are the resulting clusters. label -1 1221 0 1349 1 2 2 2 3 4 ... …
I am trying to implement a U-net for semantic segmentation with two classes (foreground=1 and background=0) in the segmentation mask images, following this tutorial. They have used SparseCategoricalCrossentropy for OUTPUT_CLASSES = 3, as shown below: OUTPUT_CLASSES = 3 model = unet_model(output_channels=OUTPUT_CLASSES) model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) EPOCHS = 20 VAL_SUBSPLITS = 5 VALIDATION_STEPS = info.splits['test'].num_examples//BATCH_SIZE//VAL_SUBSPLITS model_history = model.fit(train_batches, epochs=EPOCHS, steps_per_epoch=STEPS_PER_EPOCH, validation_steps=VALIDATION_STEPS, validation_data=test_batches, callbacks=[DisplayCallback()]) If I use the same network with OUTPUT_CLASSES = 2 the training loss is NaN. What should I use …
So, I just finished a 48 hr datathon, and I did terribly, to be honest. It was my first datathon. We were given a list of datasets: 5 months of taxi demand data (January to May) Weather dataset Zone neighbors dt (date and time of prediction) And we were told to build a time series forecasting model to forecast the taxi demand. We were told to do it in a forecasting manner, like, Train with January and Test with February, …
I have a data frame with a Audio Transcript column from customer care phone conversation. I have created one list with words and sentences words = ["rain", "buy new house", "tornado"] What I need to do is create a column in the data frame which checks these words in the text column row by row and if it presents then update the column with word and it's frequency. For example first row text "I was going to buy new house …