Timing sequence in MapReduce

I'm running a test on MapReduce algorithm in different environments, like Hadoop and MongoDB, and using different types of data. What are the different methods or techniques to find out the execution time of a query. If I'm inserting a huge amount of data, consider it to be 2-3GB, what are the methods to find out the time for the process to be completed.
Category: Data Science

Levenshtein distance vs simple for loop

I have recently begun studying different data science principles, and have had a particular interest as of late in fuzzy matching. For preface, I'd like to include smarter fuzzy searching in a proprietary language named "4D" in my workplace, so access to libraries is pretty much non existent. It's also worth noting that client side is single threaded currently, so taking advantage of multi-threaded matrix manipulations is out of the question. I began studying the levenshtein algorithm and got that …
Category: Data Science

More efficient way to create frequency column based on different groupings

I have code below that calculates a frequency for each column element (respective to it's own column) and adds all five frequencies together in a column. The code works but is very slow and the majority of the processing time is spent on this process. Any ideas to accomplish the same goal but more efficiently? Create_Freq <- function(Word_List) { library(dplyr) Word_List$AvgFreq <- (Word_List%>% add_count(FirstLet))[,"n"] + (Word_List%>% add_count(SecLet))[,"n"] + (Word_List%>% add_count(ThirdtLet))[,"n"] + (Word_List%>% add_count(FourLet))[,"n"] + (Word_List%>% add_count(FifthLet))[,"n"] return(Word_List) } ```
Category: Data Science

Inbetween CNN and MLP: neural network architecture for "close to convolutional" problem?

I am looking to approximate an (expensive to calculate precisely) forward problem using a NN. Input and output are vectors of identical length. Although not linear, the output somewhat resembles a convolution with a kernel, but the kernel is not constant but varies smoothly along the offset in the vector. I can only provide a limited training set, so I'm looking for a way to exploit this smoothness. Correct me if I'm wrong (I'm completely new to ML/NN), but in …
Category: Data Science

Techniques to increase the evaluation speed of a neural network

This is somewhat of an open ended question and in some respects a literature request (I would love to be pointed to a survey paper if one exists). Suppose I am constructing a neural network to make some arbitrary prediction (either categorical, or numeric, doesn't matter). With this network I am concerned primarily with speed of evaluation. Obviously, I want the network to give as accurate as possible predictions, but I'm more than willing to sacrifice some accuracy if it …
Category: Data Science

Efficient method of performing within matrix similarity

I want to compute a similarity comparison for each entry in a dataset to every other entry that is labeled as class 1 (excluding the current entry if it has a label of 1). So, consider a matrix of training data that has columns for ID and class/label, and then a bunch of data columns. ID Label var1 var2 var3 ... varN 1 1 0.26 0.44 0.2 0.11 2 0 0.13 0.34 0.14 0.21 3 1 0.22 0.34 0.45 0.57 …
Category: Data Science

Set value for column based on two other columns in pandas dataframe

I have a dataframe that has contracts with different order dates and I need to create a new column that assign a number to each contract if it has more than one order date. For example my sample dataframe looks something like this: df = pd.DataFrame({'contract': ['123A','123A','123A','123A','123B','123B','123C'],'prod': ['X1','M1','V1','D1','A1','B1','C1'],'date':['2019-04-17','2019-07-02','2019-04-17','2019-07-02','2019-04-17','2019-09-01','2019-08-02'],'revenue': [5688,113932,5688,49157,5002,892,9000]}) I need my final table to have another column with a unique contract id for each date. My final table from above should look something like this: contract date header_contract 123A …
Category: Data Science

What is the difference in computational cost at inference time between object detection and semantic segmentation?

I am aware that YOLO (v1-5) is a real-time object detection model with moderately good overall prediction performance. I know that UNet and variants are efficient semantic segmentation models that are also fast and have good prediction performance. I cannot find any resources comparing the inference speed differences between these two approaches. It seems to me that semantic segmentation is clearly a more difficult problem, to classify each pixel in an image, than object detection, drawing bounding boxes around objects …
Category: Data Science

Can I say that a trained neural network model with less parameters requires less resources during real world inference?

Let us imagine that we have two trained neural network models with different architectures (e.g., type of layers). The first model (a) uses 1D convolutional layers with fully-connected layers and has 10 million learnable prameters. The second model (b) does use 2d conv layer with and has only 1 million paramerts in total. Both model achieve equal scores on the same input data set. Can I say that model b with less parameter is more favourable because it has less …
Category: Data Science

Deep learning on cloud

I am trying to implement some deep learning models with large amount of data around 10gigabyte. Although, my Laptop and Collab-free crashes when it tries to load them. Do you think it worths to buy collab-pro? Do you suggest any other solutions? But my worries are mostly about buying collab-pro is only for US and Canada while I am from Europe. Thanks in advance.
Category: Data Science

Efficiently Sending Two Series to a Function For Strings with an application to String Matching (Dice Coefficient)

I am using a Dice Coefficient based function to calculate the similarity of two strings: def dice_coefficient(a,b): try: if not len(a) or not len(b): return 0.0 except: return 0.0 if a == b: return 1.0 if len(a) == 1 or len(b) == 1: return 0.0 a_bigram_list = [a[i:i+2] for i in range(len(a)-1)] b_bigram_list = [b[i:i+2] for i in range(len(b)-1)] a_bigram_list.sort() b_bigram_list.sort() lena = len(a_bigram_list) lenb = len(b_bigram_list) matches = i = j = 0 while (i < lena and j …
Category: Data Science

When is a Model Underfitted?

Logic often states that by underfitting a model, it's capacity to generalize is increased. That said, clearly at some point underfitting a model cause models to become worse regardless of the complexity of data. How do you know when your model has struck the right balance and is not underfitting the data it seeks to model? Note: This is a followup to my question, "Why Is Overfitting Bad?"
Category: Data Science

Memory efficient encoding logic for group categories

I have a huge dataset with categorical data. It is comprised of alerts having multiple properties. Each alert belongs to a group, and some even belong to multiple groups. It looks somewhat like this: GroupID System State TimeStamp etc... 0 [1, 2, 3, 4] A REC ... 1 [1, 2, 3, 4] A SNT ... 2 [2, 4] B REC 3 [2, 4] B PND 4 [2, 4] B COM 5 [2, 4] B SNT 6 [2] C RCV 7 …
Category: Data Science

Ways to speed up Python code for data science purposes

Although it might sound like a pure techie question, I would like to know which ways you usually try out, for very data science-like processes, when you need to speed up your processes (given that the data retrieval is not a problem and that it also fits in memory etc). Some of those could be the following, but I would like to receive feedback about any other else: good practices as always using Numpy when possible on numeric operations and …
Category: Data Science

How can calculate Efficiency for predictive models based on accuracy or error over time?

I was wondering if I could express the efficiency of prognostic models according to their accuracy(error, e.g. MAPE or MSE) over time [sec]. So let's imagine I have the following results for different predictive models: models MSE MAE MAPE predicting Time[sec] LSTM 0.12 0.13 15.67% 456789 GRU 0.06 0.05 5.89% 688741 RNN 0.45 0.51 25.33% 55555 What is the best way to illustrate the efficiency of predictive models over predicting time? Is the following equation right? how about its unit …
Category: Data Science

Finding synergies among observations of equal length

Assume we have a set $I$ with 20 different items (we call them $I_0$, $I_1$ up to $I_{19}$). Also we have $n$ observations $O \in I^{n\times 8}$; so each observation is a subset of $I$ with exactly 8 items and is labeled with a score. Just as an illustration here are some made up observations with their score: $O_1=\{I_0, I_8, I_9, I_{10}, I_{14}, I_{15}, I_{16}, I_{17}\};s_1=0.995$ $O_2=\{I_0, I_1, I_2, I_3, I_4, I_5, I_6, I_7\};s_2=0.667$ $O_3=\{I_2, I_3, I_9, I_{15}, I_{16}, I_{17}, …
Topic: efficiency
Category: Data Science

RIMS declining efficiency

For many years, I was getting efficiency at about 81% after backing into the numbers with Brewers Friend doing simple non-recirculated infusions at 2qt per lb with continuous fly sparge. About a year ago, I built a electric RIMS 240 volt, 5500 watt, PID controlled system, so that I could do some step mashes and manage temps better. I like the set up, but my efficiency has gone down to 61% with continuous recirculation and the RIMS firing as needed …
Category: Mac

What Calcium ppm is required in the mash for alpha-amylase stability and mash efficiency?

I recently moved and my water here is quite soft with a very low pH. As such, I've taken to adding many of my brewing salts in the brew kettle and only using salts in the mash to balance the pH. However, with a very low pH in the water to begin with, brewing a very dark beer means I'm going to leave as much calcium out of the mash as possible (because calcium lowers the pH). I do add …
Category: Mac

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.