How to save hugging face fine tuned model using pytorch and distributed training

I am fine tuning masked language model from XLM Roberta large on google machine specs. When I copy the model using gsutil and subprocess from container to GCP bucket it gives me error. Versions Versions torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 transformers==4.17.0 I am using pre-trained Hugging face model. I launch it as train.py file which I copy inside docker image and use vertex-ai ( GCP) to launch it using Containerspec machineSpec = MachineSpec(machine_type="a2-highgpu-4g",accelerator_count=4,accelerator_type="NVIDIA_TESLA_A100") python -m torch.distributed.launch --nproc_per_node 4 train.py --bf16 I am …
Category: Data Science

Use distribution probability as a feature in ML model

I built an LSMT model to predict sick cows. I also have risk factors like cow size and height (static risk factor) that I want to combine into the ML model. I found that size is geometrically distributed. My question is how I insert it as a feature to the model? I know that $P(x=K)= p*q^(k-1)$ but I don't know how to combine it as a feature. Thank you.
Category: Data Science

Distributed training with low level Tensorflow API

I am using low level Tensorflow API's for my model training. When I say low level it means I'm defining the tf.Session() object of the graph and evaluate graph with in this session. I would like to distribute the model training using tf.distribute.MirroredStrategy(). I am able to use mirroredstrategy() on tensorflow sequential API's using the example shared by tensorflow in their document. But I am facing difficulty in executing tf low level code using mirror strategy. I tried to use …
Category: Data Science

Speed of training decrease by adding more GPUs

I am using the distributed Tensorflow with Mirror Strategy. I am training the VGG16 based on custom Estimator. However, by increasing the number of GPUs time of training is increased. As I check, the GPUs Utilization is about 100% and it seems the input function can feed data to GPUs. As all GPUs are in a single machine, Is there any clue to found out the problem. This is the computation graph and I am wondreing the Groups_Deps cause the …
Category: Data Science

Combining CNNs for image classification

I would like to take the output of an intermediate layer of a CNN (layer G) and feed it to an intermediate layer of a wider CNN (layer H) to complete the inference. Challenge: The two layers G, H have different dimensions and thus it can't be done directly. Solution: Use a third CNN (call it r) which will take as input the output of layer G and output a valid input for layer H. Then both the weights of …
Category: Data Science

Horovod vs tf.mirror.strategy

I am exploring the distributed computing using Horovod and Tensorflow.mirror.strategy. I have a machine which has two GPU's. Using the basic mnist code(provided in tf documents), I tried to utilize both the GPU's for training. I'm running excatly same piece of code and confused with the usage of H/W resources. When I'm using Horovod: [0] GeForce RTX 2080 Ti | 43'C, 13 % | 849 / 11019 MB | vipin(841M) gdm(4M) [1] GeForce RTX 2080 Ti | 40'C, 14 % …
Category: Data Science

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] centroids = randomize_centroids(data, centroids, k) old_centroids = [[] for i in range(k)] iterations = 0 while not (has_converged(centroids, old_centroids, iterations)): iterations += 1 clusters = [[] for i in range(k)] # assign data points to clusters clusters = euclidean_dist(data, centroids, clusters) …
Category: Data Science

Large Graphs: NetworkX distributed alternative

I have built some implementations using NetworkX(graph Python module) native algorithms in which I output some attributes which I use them for classification purposes. I want to scale it to a distributed environment. I have seen many approaches like neo4j, Graphx, GraphLab. However, I am quite new to this, thus I want to ask, which of them would be easy to locally apply graph algorithms (ex. node centrality measures), preferably using Python. To be more specific, which available option is …
Category: Data Science

CountVectorizer vs HashVectorizer for text

I'd like to tokenize a column of my training data (n-gram word-wise), but I'm working with a very large dataset distributed across a compute cluster. For this use case, Count Vectorizer doens't work well because it requires maintaining a vocabulary state, thus can't parallelize easily. Instead, for distributed workloads, I read that I should instead use a HashVectorizer. My issue is that there are no generated labels now. Throughout training and at the end, I'd like to see which words …
Category: Data Science

Federated learning - share of ROI

I am reading about federated learning and have a quick question 1) I know in federated learning, the model updates are shared to a central server 2) All the parties involved in FL can generate benefits because their model has seen more variation in data (due to different parties involved) But my question is, Let's say Site A contributes/has 80% (more data points) of the data and site B has only 20% (less data points). So we know that in …
Category: Data Science

Pytorch Distributed Computing - Recomendations/Resources/Courses?

I would like to get into some distributed computing for processing Pytorch CNN models. I am completely fresh in this field and want to get some recommendations as to where I should start researching and learning techniques in distributed computing specifically for Deep Learning. My motivation is that I have access to a lot of personal Windows 10 Desktops with great hardware, a few Ubuntu Linux machines of my own and then my personal desktop that is rigged with great …
Category: Data Science

What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for Spark, but I'm curious if anyone has encountered a problem that was more efficient and easier to solve with Spark compared to Hadoop.
Category: Data Science

Python distributed machine learning

I occasionally train neural nets for my research, and they usually take quite a long time to run (especially when I'm working on my laptop). I'm looking for a way to build the model on any computer and send it up to a server for training and have it return the graphs/accuracies/weights etc. I know there are paid solutions for this but I'm looking for a distributed solution I can run myself. I have a server set up at home …
Category: Data Science

Who is actually sharing physical RAM in a distributed sytem that has virtual shared memory? (Server and/or clients.)

There is a business with about 100 computers used by employees, and one high-powered server. It's called a "distributed system" by the system architect. It uses Distributed Shared Memory (DSM). There's also middleware, and the server is hosting Virtual Machines (VMs) which are running the applications that the employees see. The question is: does the DSM come from physical memory that the server is sharing, creating virtual shared memory, or does the memory come from those 100 computers (or both)? …
Topic: distributed
Category: Data Science

What is the difference between Pytorch's DataParallel and DistributedDataParallel?

I am going through this imagenet example. And, in line 88, the module DistributedDataParallel is used. When I searched for the same in the docs, I haven’t found anything. However, I found the documentation for DataParallel. So, would like to know what is the difference between the DataParallel and DistributedDataParallel modules.
Category: Data Science

Updating Weight Using Updates on Related Data

Suppose $$ x=Ay $$ The $x$ is $M\times 1$, $y$ is $N \times 1$ and $A$ is $M\times N$ We have the data $x$ and would like to know what $y$ is. However, the matrix $A$ is too large for pseudo-inverse. And thus we would like to approximate $A^{-1}$ using machine learning as it is possible to parallelize it. Here for parallelization, we divide the given problem into: $$ x^l = A^l y $$ where $x = [x^1 , x^2,\dots,x^L]^T$ …
Category: Data Science

Why can distributed deep learning provide higher accuracy (lower error) than non-distributed one with the following cases?

Based on some papers which I read, distributed deep learning can provide faster training time. In addition, it also provides better accuracy or lower prediction error. What are the reasons? Question edited: I am using Tensorflow to run distributed deep learning (DL) and compare the performance with non-distributed DL. I use the number of dataset 1000 samples and step size 10000. The distributed DL uses 2 workers and 1 parameter server. Then, the following cases are considered when running the …
Category: Data Science

Understanding how distributed PCA works

As part of big data analysis project, I'm working on, I need to perform PCA on some data, using cloud computing system. In my case, I'm using Amazon EMR for the job and Spark in particular. Leaving the "How-to-perform-PCA-in-spark" question aside, I want to get an understanding of how things work behind the scenes when it comes to calculating PCs on cloud-based architecture. For example, one of the means to determine PCs of a data is to calculate covariance matrix …
Category: Data Science

Implementation of a distributed data mining paper

I have a project about distributed data mining and I need to do some implementations, So I've searched and found this paper. The address of dataset is mentioned in the paper and I've downloaded it. For the process I should split the dataset into 10 smaller datasets. And the other task is using Weka4WS (weka for web services) for the process (For clustering part). So my questions: 1. How can I split the dataset using python code? 2. What is …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.