distributed

How to save hugging face fine tuned model using pytorch and distributed training

MAC

2022年4月12日 03:25

I am fine tuning masked language model from XLM Roberta large on google machine specs. When I copy the model using gsutil and subprocess from container to GCP bucket it gives me error. Versions Versions torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 transformers==4.17.0 I am using pre-trained Hugging face model. I launch it as train.py file which I copy inside docker image and use vertex-ai ( GCP) to launch it using Containerspec machineSpec = MachineSpec(machine_type="a2-highgpu-4g",accelerator_count=4,accelerator_type="NVIDIA_TESLA_A100") python -m torch.distributed.launch --nproc_per_node 4 train.py --bf16 I am …

Topic: huggingface google-cloud pytorch python-3.x distributed

Category: Data Science

Use distribution probability as a feature in ML model

Mor

2022年4月4日 10:01

I built an LSMT model to predict sick cows. I also have risk factors like cow size and height (static risk factor) that I want to combine into the ML model. I found that size is geometrically distributed. My question is how I insert it as a feature to the model? I know that $P(x=K)= p*q^(k-1)$ but I don't know how to combine it as a feature. Thank you.

Topic: probability deep-learning python distributed

Category: Data Science

Distributed training with low level Tensorflow API

vipin bansal

2022年3月27日 12:07

I am using low level Tensorflow API's for my model training. When I say low level it means I'm defining the tf.Session() object of the graph and evaluate graph with in this session. I would like to distribute the model training using tf.distribute.MirroredStrategy(). I am able to use mirroredstrategy() on tensorflow sequential API's using the example shared by tensorflow in their document. But I am facing difficulty in executing tf low level code using mirror strategy. I tried to use …

Topic: tensorflow distributed

Category: Data Science

Speed of training decrease by adding more GPUs

skh251

2022年2月16日 08:03

I am using the distributed Tensorflow with Mirror Strategy. I am training the VGG16 based on custom Estimator. However, by increasing the number of GPUs time of training is increased. As I check, the GPUs Utilization is about 100% and it seems the input function can feed data to GPUs. As all GPUs are in a single machine, Is there any clue to found out the problem. This is the computation graph and I am wondreing the Groups_Deps cause the …

Topic: gpu tensorflow distributed

Category: Data Science

Combining CNNs for image classification

Andrew

2021年8月2日 14:26

I would like to take the output of an intermediate layer of a CNN (layer G) and feed it to an intermediate layer of a wider CNN (layer H) to complete the inference. Challenge: The two layers G, H have different dimensions and thus it can't be done directly. Solution: Use a third CNN (call it r) which will take as input the output of layer G and output a valid input for layer H. Then both the weights of …

Topic: inference convolutional-neural-network image-classification deep-learning distributed

Category: Data Science

Horovod vs tf.mirror.strategy

vipin bansal

2021年6月17日 07:48

I am exploring the distributed computing using Horovod and Tensorflow.mirror.strategy. I have a machine which has two GPU's. Using the basic mnist code(provided in tf documents), I tried to utilize both the GPU's for training. I'm running excatly same piece of code and confused with the usage of H/W resources. When I'm using Horovod: [0] GeForce RTX 2080 Ti | 43'C, 13 % | 849 / 11019 MB | vipin(841M) gdm(4M) [1] GeForce RTX 2080 Ti | 40'C, 14 % …

Topic: tensorflow distributed

Category: Data Science

How to make k-means distributed?

gsamaras

2020年8月11日 08:46

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] centroids = randomize_centroids(data, centroids, k) old_centroids = [[] for i in range(k)] iterations = 0 while not (has_converged(centroids, old_centroids, iterations)): iterations += 1 clusters = [[] for i in range(k)] # assign data points to clusters clusters = euclidean_dist(data, centroids, clusters) …

Topic: map-reduce python distributed apache-hadoop k-means

Category: Data Science

Large Graphs: NetworkX distributed alternative

20roso

2020年7月21日 16:56

I have built some implementations using NetworkX(graph Python module) native algorithms in which I output some attributes which I use them for classification purposes. I want to scale it to a distributed environment. I have seen many approaches like neo4j, Graphx, GraphLab. However, I am quite new to this, thus I want to ask, which of them would be easy to locally apply graph algorithms (ex. node centrality measures), preferably using Python. To be more specific, which available option is …

Topic: graphs distributed machine-learning

Category: Data Science

CountVectorizer vs HashVectorizer for text

jrdzha

2020年6月30日 22:46

I'd like to tokenize a column of my training data (n-gram word-wise), but I'm working with a very large dataset distributed across a compute cluster. For this use case, Count Vectorizer doens't work well because it requires maintaining a vocabulary state, thus can't parallelize easily. Instead, for distributed workloads, I read that I should instead use a HashVectorizer. My issue is that there are no generated labels now. Throughout training and at the end, I'd like to see which words …

Topic: hashingvectorizer tokenization nlp distributed

Category: Data Science

Federated learning - share of ROI

The Great

2020年5月18日 03:16

I am reading about federated learning and have a quick question 1) I know in federated learning, the model updates are shared to a central server 2) All the parties involved in FL can generate benefits because their model has seen more variation in data (due to different parties involved) But my question is, Let's say Site A contributes/has 80% (more data points) of the data and site B has only 20% (less data points). So we know that in …

Topic: deep-learning random-forest neural-network distributed machine-learning

Category: Data Science

Pytorch Distributed Computing - Recomendations/Resources/Courses?

Mason Acree

2020年4月29日 02:25

I would like to get into some distributed computing for processing Pytorch CNN models. I am completely fresh in this field and want to get some recommendations as to where I should start researching and learning techniques in distributed computing specifically for Deep Learning. My motivation is that I have access to a lot of personal Windows 10 Desktops with great hardware, a few Ubuntu Linux machines of my own and then my personal desktop that is rigged with great …

Topic: pytorch deep-learning distributed parallel

Category: Data Science

What are the use cases for Apache Spark vs Hadoop

idclark

2020年4月23日 16:00

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for Spark, but I'm curious if anyone has encountered a problem that was more efficient and easier to solve with Spark compared to Hadoop.

Topic: cloud-computing apache-spark knowledge-base distributed apache-hadoop

Category: Data Science

Python distributed machine learning

Simon

2019年10月6日 17:35

I occasionally train neural nets for my research, and they usually take quite a long time to run (especially when I'm working on my laptop). I'm looking for a way to build the model on any computer and send it up to a server for training and have it return the graphs/accuracies/weights etc. I know there are paid solutions for this but I'm looking for a distributed solution I can run myself. I have a server set up at home …

Topic: neural-network python distributed machine-learning

Category: Data Science

Who is actually sharing physical RAM in a distributed sytem that has virtual shared memory? (Server and/or clients.)

2019年10月4日 17:40

There is a business with about 100 computers used by employees, and one high-powered server. It's called a "distributed system" by the system architect. It uses Distributed Shared Memory (DSM). There's also middleware, and the server is hosting Virtual Machines (VMs) which are running the applications that the employees see. The question is: does the DSM come from physical memory that the server is sharing, creating virtual shared memory, or does the memory come from those 100 computers (or both)? …

Topic: distributed

Category: Data Science

What is the difference between Pytorch's DataParallel and DistributedDataParallel?

Dawny33

2019年7月27日 13:53

I am going through this imagenet example. And, in line 88, the module DistributedDataParallel is used. When I searched for the same in the docs, I haven’t found anything. However, I found the documentation for DataParallel. So, would like to know what is the difference between the DataParallel and DistributedDataParallel modules.

Topic: pytorch gpu distributed

Category: Data Science

Updating Weight Using Updates on Related Data

Varun Chhangani

2019年6月28日 06:14

Suppose $$ x=Ay $$ The $x$ is $M\times 1$, $y$ is $N \times 1$ and $A$ is $M\times N$ We have the data $x$ and would like to know what $y$ is. However, the matrix $A$ is too large for pseudo-inverse. And thus we would like to approximate $A^{-1}$ using machine learning as it is possible to parallelize it. Here for parallelization, we divide the given problem into: $$ x^l = A^l y $$ where $x = [x^1 , x^2,\dots,x^L]^T$ …

Topic: mathematics gradient-descent distributed parallel

Category: Data Science

What is meant by Distributed for a gradient boosting library?

Tommaso Bendinelli

2018年11月15日 22:43

I am checking out XGBoost documentation and it's stated that XGBoost is an optimized distributed gradient boosting library. What is meant by distributed? Have a nice day

Topic: boosting xgboost distributed

Category: Data Science

Why can distributed deep learning provide higher accuracy (lower error) than non-distributed one with the following cases?

bnbfreak

2018年11月6日 11:15

Based on some papers which I read, distributed deep learning can provide faster training time. In addition, it also provides better accuracy or lower prediction error. What are the reasons? Question edited: I am using Tensorflow to run distributed deep learning (DL) and compare the performance with non-distributed DL. I use the number of dataset 1000 samples and step size 10000. The distributed DL uses 2 workers and 1 parameter server. Then, the following cases are considered when running the …

Topic: tensorflow deep-learning distributed

Category: Data Science

Understanding how distributed PCA works

Adiel

2018年10月27日 00:44

As part of big data analysis project, I'm working on, I need to perform PCA on some data, using cloud computing system. In my case, I'm using Amazon EMR for the job and Spark in particular. Leaving the "How-to-perform-PCA-in-spark" question aside, I want to get an understanding of how things work behind the scenes when it comes to calculating PCs on cloud-based architecture. For example, one of the means to determine PCs of a data is to calculate covariance matrix …

Topic: pca apache-spark distributed bigdata data-mining

Category: Data Science

Implementation of a distributed data mining paper

Nebula

2018年8月7日 08:52

I have a project about distributed data mining and I need to do some implementations, So I've searched and found this paper. The address of dataset is mentioned in the paper and I've downloaded it. For the process I should split the dataset into 10 smaller datasets. And the other task is using Weka4WS (weka for web services) for the process (For clustering part). So my questions: 1. How can I split the dataset using python code? 2. What is …

Topic: weka distributed data-mining

Category: Data Science

About