Should weight distribution change more when fine-tuning transformers-based classifier?

I'm using pre-trained DistilBERT model from Huggingface with custom classification head, which is almost the same as in the reference implementation: class PretrainedTransformer(nn.Module): def __init__( self, target_classes): super().__init__() base_model_output_shape=768 self.base_model = DistilBertModel.from_pretrained("distilbert-base-uncased") self.classifier = nn.Sequential( nn.Linear(base_model_output_shape, out_features=base_model_output_shape), nn.ReLU(), nn.Dropout(0.2), nn.Linear(base_model_output_shape, out_features=target_classes), ) for layer in self.classifier: if isinstance(layer, nn.Linear): layer.weight.data.normal_(mean=0.0, std=0.02) if layer.bias is not None: layer.bias.data.zero_() def forward(self, input_, y=None): X, length, attention_mask = input_ base_output = self.base_model(X, attention_mask=attention_mask)[0] base_model_last_layer = base_output[:, 0] cls = self.classifier(base_model_last_layer) return cls During …
Category: Data Science

Generating the right target for an LSTM model

Trying to explain my question on a simplified data set. Having the following dataset: day f1 f2 0 0 10 1000 1 1 45 2000 2 2 120 3400 3 3 90 5000 I'm trying two approaches to generates a score based on the data observations: Approach 1: I've scaled the features so the max value is 1.0 by dividing each feature by it's max value to get: day f1 f2 0 0 0.083333 0.20 1 1 0.375000 0.40 2 …
Category: Data Science

Why do histogram bars vanish when we keep the bins value high in matplotlib?

Also, the histogram bar widths are different on certain values of bin. How to keep the bar widths uniform? I have tried using the rwidth but that dos not solve my problem. Data: test age 17 - Alpha OH PROGESTERONE - HORMONE ASSAYS 23 17 - Alpha OH PROGESTERONE - HORMONE ASSAYS 26 17 ALPHA HYDROXY PROGESTERONE 18 17 ALPHA HYDROXY PROGESTERONE 21 17 ALPHA HYDROXY PROGESTERONE 25 17 ALPHA HYDROXY PROGESTERONE 27 Code axes = plt.gca() axes.set_xlim(0, 100) axes.set_ylim(0, …
Category: Data Science

CDF plot overlay histogram in python

I have a dataframe called df['ProgressStep'] I would like to get overlaid CDF plot in histogram. Have tried 2 methods, neither one meet my target perfectly. please help to fine tune the code, either method is fine for me. how can I do the following things: (1) add/edit plot title and Y axis title; (2) add/edit primary X axis title, for example, I want more granularity here; (3) for overlapped plots, add secondary X axis against histogram ; (4) show …
Category: Data Science

Multi-modal histogram and real-world measurements

I have a histogram of real-world measurements of the wind speed at a given site. There are many 0's in the dataset, presumably because the wind was far to gentle to trigger the sensor into reading anything at all. My question is how should I fit functions to this data, and could anyone point me to a good resource on this subject?
Category: Data Science

Can the same data set (dynamic) be described as Chaotic & Pareto?

I'm trying to abstract the mathematical part of the problem as much as possible before the details follow, There's this dynamic data set that's $O(2^{32})$, a recent result described it as a power-law distribution, as average is approaching $1-2$ with a peak at $100$ as said. I was just motivated by the fact that there is a subset known to have sometimes values of $O(10^5)$ inside, and the 1st lesson on Statistics is that average is not enough to represent …
Category: Data Science

How to evaluate KDE against histogram?

I am currently testing some approaches for density estimation, and I think the basic approach of histograms may not be the best option to me and KDE is certainly a good alternative to go. While ago I found a very interesting tutorial by Jake VanderPlas which explains KDE in a nice way. In his tutorial, Jake optimized KDE bandwidth selection using grid search maximizing the log-likelihood of the density estimation given some samples, but that is built-in in sklearn and …
Category: Data Science

Getting different visualization results for jupiter and datacamp existing code shell. How to solve this?

The left one image is in jupiter notebook and the right one is from datacamp exercises. Can anyone please let me know why I am getting different results in Jupiter? Used hacker statistics to calculate the chances of winning a bet. Used random number generators, loops, and Matplotlib to gain a competitive edge! import numpy as np import matplotlib.pyplot as plt np.random.seed(123) # Simulate random walk 500 times all_walks = [] for i in range(500) : random_walk = [0] for …
Category: Data Science

Plotting different values in pandas histogram with different colors

I am working on a dataset. The dataset consists of 16 different features each feature having values belonging to the set (0, 1, 2). In order to check the distribution of values in each column, I used pandas.DataFrame.hist() method which gave me a plot as shown below: I want to represent the distribution for each value in a column with different color. For example, in column 1, all the values corresponding to '0' should be in red color while the …
Category: Data Science

Histogram plot with plt.hist()

I am a Python-Newbie and want to plot a list of values between -0.2 and 0.2. The list looks like this [...-0.01501152092971969, -0.01501152092971969, -0.01501152092971969, -0.01501152092971969, -0.01501152092971969, -0.01501152092971969, -0.01501152092971969, -0.01501152092971969, -0.01501152092971969, -0.01489985147131656, -0.015833709930856088, -0.015833709930856088, -0.015833709930856088, -0.015833709930856088, -0.015833709930856088...and so on]. In statistics I've learned to group my data into classes to get a useful plot for a histogram, which depends on such large data. How can I add classes in python to my plot? My code is plt.hist(data) and histogram looks like …
Category: Data Science

Triplet optimization producing a weird diagonal line?

I'm pretty sure this is the right forum for this, or let me know otherwise, I'll happily move this to a better place. I have a strange problem. I've written an algorithm designed to take three files of UNIX timestamps, and produce a list of triplets in order of closeness. Each triplet is unique (no two triplets share an element), each triplet has one element from each file, and each triplet {x,y,z} is created so as to minimize max(x,y,z) - …
Category: Data Science

How to better represent three sets of categorical data?

Given three set of data with categorical integer x-axis with the same range (0-10): from itertools import chain from collections import Counter, defaultdict from IPython.display import Image import pandas as pd import numpy as np import seaborn as sns import colorlover as cl import matplotlib.pyplot as plt data1 = Counter({8: 10576, 9: 10114, 7: 9504, 6: 7331, 10: 6845, 5: 5007, 4: 3037, 3: 1792, 2: 908, 1: 368, 0: 158}) data2 = Counter({5: 9030, 6: 8347, 4: 8149, 7: …
Category: Data Science

What is the difference between HLC (Histogram of local features) , CSS ( color self-similarity) ans MDST (Max DisSimilarity of Different Templates)

I'm new to computer vision and have been researching for Master thesis purposes in Detection algorithms and the techniques used in each. As I arrived to the point where alot of papers showed the importance of color in object recognition, i got got bumped with HLC MDST and CSS. So my question is : are they all literlally a way to describe the distribution of the color in an image? If yes I would be glad for a brief explanation …
Category: Data Science

Histogram with financial (decimal) amounts vs. normal numeric

Take the following historgram data: This is an item of "bin size" 1 from 0 onwards. However, I do not think this looks appropriate, as every time I have seen a histogram (or someone has requested it), it has unambiguous values, such as: $ 0.00 - $0.99 $ 1.00 - $1.99 etc. However, not even Excel does this correctly, so I was wondering if there was something like a suggested "significant figures" to apply to a histogram so that: (1) …
Category: Data Science

Exploratory statistics, how to idenify and remove driver (bias)

I am looking at customer data, and created frequency tables (+histograms) for customers with different professional statuses and what the best time is to reach them. Status ranges here from employed, retired, self-employed, unemployed, blank. For each of these statuses, I expected some variation in terms of when the best time is to reach each type of customer. Intuitively and from experience e.g. employed people, on average, should be available early in the morning or early evening, while unemployed are …
Category: Data Science

Fitting a pandas dataframe to a Poisson Distribution

I have a simple dataframe df2 that consist of indices and one column of values. I want to fit this dataframe to a poisson distribution. Below is the code I am using: import numpy as np from scipy.optimize import curve_fit data=df2.values bins=df2.index def poisson(k, lamb): return (lamb^k/ np.math.factorial(k)) * np.exp(-lamb) params, cov = curve_fit(poisson, np.array(bins.tolist()), data.flatten()) I get the following error: TypeError: only size-1 arrays can be converted to Python scalars
Category: Data Science

Re-sampling of a Histograms Bins

I would like to be able to resample a histograms bins without having access tot he raw data. And just to be clear, by resample, I mean to change the number of bins and still provide a good estimate of the original probabilities of those bins. I can think of many ways to do this, but having trouble figuring out which is the best method which maintains the same probability in the resulting histogram. The easy one would be if …
Category: Data Science

Histograms in Machine Learning

I have a large data set with over 100k samples and I want to predict a continuous target feature from 4 other continuous features using Scikit Learn. For this project, I would like to visualize and analyze the data using both 1 dimensional and two dimensional histograms. I know how to plot histograms and I know what a histogram means/displays mathematically but how can I make good use of it in order to analyze my data? One thing that comes …
Category: Data Science

How to force histogram plots to have same axes?

I am comparing my trained model with other benchmark models with the error histogram but the axis of histogram is different for each method as shown in figure.For instance to plot the error histogram of every method,I tried this code: % Matlab code Targets=Actual; Outputs=Predicted_by_model; errors=Targets-Outputs; error_std=std(errors); MAPE=mean(abs(Targets-Outputs)./Targets)*100; histfit(errors); legend('Proposed') title(['MAPE = ' num2str(MAPE) ' , Error St.D. = ' num2str(error_std)])) How to keep axis of every method to the same value.
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.