RAM crashed for XML to DataFrame conversion function

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code? Code #Libraries import pandas as pd import xml.etree.cElementTree as ET #Function to convert XML file to Pandas Dataframe def xml2df(file_path): #Parsing XML File and …
Category: Data Science

Expected performance of training tf.keras.Sequential model with model.fit, model.fit_generator and model.train_on_batch

I am using Keras with Tensorflow backend to train a simple 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to model.fit for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches …
Category: Data Science

SKLearn - Different Results B/w Default Linear Model and1st Order Polynomial Linear Model

SUMMARY I'm building a linear regression model using Scikit and noticing that the model "performance" (RMSE and max error, namely) varies depending on whether I use the default LR or whether I apply PolynomialFeature(degree=1). My understanding is that these outcomes should be identical, since they are both utilizing a single-order LR model, however, my error is consistently lower when using the PolyFeatures version. TLDR When I run the code below, the second chunk (polynomial = degree of 1) is consistently …
Category: Data Science

Why yolo4 pytorch re-training loss seems high as like first time training?

I had a setup a yolo4 pytorch framework in google colab by cloning git clone https://github.com/roboflow-ai/pytorch-YOLOv4.git. I generated checkpoints by giving training. As we need more robust training model, I given training again with assigning pretrained checkpoints but it seems loss started with high value as like first time training. Code is for training !python train.py -b 2 -s 1 -l 0.001 -g 0 -pretrained ./Yolov4_epoch100_latest.pth -classes 1 -dir ./train -epochs 100. Not sure if my pretrained checkpoint is used …
Category: Data Science

How to Approach Linear Machine-Learning Model When Input Variables are Inconsistent

Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals. I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data. This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this …
Category: Data Science

How solved "ValueError: y should be a 1d array, got an array of shape () instead."?

from tkinter import * from tkinter import ttk from tkmacosx import Button top = Tk() top.title("Jobs") top.geometry("1000x800") line1 = LabelFrame(top, text='') line1.pack(expand = 'yes', fill = 'both') n = StringVar() categorychoosen = ttk.Combobox(line1, width = 27, textvariable = n) # Adding combobox drop down list categorychoosen['values'] = ('Advocate','Arts','Automation Testing','Blockchain','Business Analyst', 'Web Designing') categorychoosen.place(x=50, y=150) categorychoosen.current() name=Label(line3,text="Welcom to ... company",font =("Arial", 10)) name.place(x=0, y=0) n1 = StringVar() sectionchoosen = ttk.Combobox(line3, width = 27, textvariable = n1) # Adding combobox drop down …
Category: Data Science

Applying Differencing on a time series, before or after train and test split?

I am attempting to improve my RNN model by making my dependent variable, a stock price, non-stationary. I am aiming to make the series stationary by removing the trend with a log transformation and then performing moving average differencing to remove noise. I have a function that initially logs the series, to penalise the larger values and then performing rolling mean differencing on the values. def moving_avg_differencing(col, n_roll=30, drop=False): log_values = np.log(col) moving_avg = log_values.rolling(n_roll).mean() ma_diff = log_values - moving_avg …
Category: Data Science

I need to plot only training curve in the fastai library using the learner.recorder.plot_losses() function . FASTAI devs pls help

I have a task where I need to only plot the training loss and not the validation loss of the plot_losses function in the fastai library with learner object having recorder class, but I am not able to properly implement the same. I am using the fastai v1 for this purpose due to project restrictions. Here is the github code for the same: class Recorder(LearnerCallback): "A `LearnerCallback` that records epoch, loss, opt and metric data during training." def plot_losses(self, skip_start:int=0, …
Category: Data Science

Visualization with many lines, colors, and markers

I have a bunch of plots as the one reported below. The data is from measurements performed on different times and different days. In the plot (which is a cumulative distribution function, if that matters), the colors differentiate data relevant to different days; the markers are used to further differentiate the data within each day. The problem is that the plot is very crowded and a bit ugly. Some markers can be barely seen. Question: Any idea how I can …
Category: Data Science

Predict continuous variable based on categorical columns mostly

I have a large dataset (40 mil rows, 50 columns) with mostly categorical columns (some of them are numerical) and I am using Python/Pandas. Categorical columns have up to 3000 unique labels. I am looking for best practices on how to approach this. Obviously one-hot encoding (OHE) as it is is out of question. I have tried to make smaller number of categories and do OHE in that way but the model was very bad, a lot of information is …
Category: Data Science

Categorical data preprocessing for training a algorithm

I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …
Category: Data Science

Create new rows based on a value in a column

My dateset is generated like the example df = {'event':['A','B','C','D'], 'budget':['123','433','1000','1299'], 'duration_days':['6','3','4','2']} I need to create rows for each event based on the column 'duration_days', if I have duration = 6 the event may have 6 rows: event budget duration_days A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 B 123 3 B 123 3 B 123 3
Category: Data Science

How to combine and separate test and train data for data cleaning?

I am working on an ML model in which I have been provided the data in 2 files test.csv and train.csv. I want to perform data cleaning on both files together be concatenating them and then separating them. I know how to concatenate 2 dataframes, but after data cleaning how will I separate the two files? Please help me complete the code. CODE test = pd.read_csv('test.csv') train = pd.read_csv('train.csv') df = pd.concat([test, train]) //Data Cleaning steps //Separating them back to …
Category: Data Science

Python: SARIMAX Model Fits too slow

I have a time series data with the date and temperature records of a city. Following are my observations from the time series analysis: By plotting the graph of date vs temperature seasonality is observed. Performing adfuller test we find that the data is already stationary, so d=0. Perform Partial Autocorrelation and Autocorrelation with First Seasonal Difference and found p=2 and q=10 respectively. Code to Train Model model=sm.tsa.statespace.SARIMAX(df['temperature'],order=(1, 1, 1),seasonal_order=(2,0,10,12)) results=model.fit() This fit function runs indefinitely and does not reach …
Category: Data Science

Encode each comma separated value in Pandas

I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …
Category: Data Science

Clustering Tweet Data using DBSCAN Algorithm

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters. The following are the parameters that I pass. dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x) The following are the resulting clusters. label -1 1221 0 1349 1 2 2 2 3 4 ... …
Category: Data Science

Tensorflow 2 Semantic Segmentation - loss function for two classes

I am trying to implement a U-net for semantic segmentation with two classes (foreground=1 and background=0) in the segmentation mask images, following this tutorial. They have used SparseCategoricalCrossentropy for OUTPUT_CLASSES = 3, as shown below: OUTPUT_CLASSES = 3 model = unet_model(output_channels=OUTPUT_CLASSES) model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) EPOCHS = 20 VAL_SUBSPLITS = 5 VALIDATION_STEPS = info.splits['test'].num_examples//BATCH_SIZE//VAL_SUBSPLITS model_history = model.fit(train_batches, epochs=EPOCHS, steps_per_epoch=STEPS_PER_EPOCH, validation_steps=VALIDATION_STEPS, validation_data=test_batches, callbacks=[DisplayCallback()]) If I use the same network with OUTPUT_CLASSES = 2 the training loss is NaN. What should I use …
Category: Data Science

Multi-level timeseries forecasting? How to do it?

So, I just finished a 48 hr datathon, and I did terribly, to be honest. It was my first datathon. We were given a list of datasets: 5 months of taxi demand data (January to May) Weather dataset Zone neighbors dt (date and time of prediction) And we were told to build a time series forecasting model to forecast the taxi demand. We were told to do it in a forecasting manner, like, Train with January and Test with February, …
Category: Data Science

Get row wise frequency count of words from list in text column pandas

I have a data frame with a Audio Transcript column from customer care phone conversation. I have created one list with words and sentences words = ["rain", "buy new house", "tornado"] What I need to do is create a column in the data frame which checks these words in the text column row by row and if it presents then update the column with word and it's frequency. For example first row text "I was going to buy new house …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.