Implementation of a distributed data mining paper

Question

Implementation of a distributed data mining paper

Nebula

2018年8月7日 08:52

I have a project about distributed data mining and I need to do some implementations, So I've searched and found this paper. The address of dataset is mentioned in the paper and I've downloaded it. For the process I should split the dataset into 10 smaller datasets. And the other task is using Weka4WS (weka for web services) for the process (For clustering part).

So my questions:
1. How can I split the dataset using python code?
2. What is Weka4WS and how should I use it, I mean the implementation (because it's not something like apps like the Weka itself)?

Topic weka distributed data-mining

Category Data Science

Bruno Lubascher · Accepted Answer · 2018年8月6日 18:16

Splitting your dataset

First you can load your data set to memory:

# Load the data to memory
with open('file.txt', 'r', encoding='utf-8') as file:
    header = file.readline()
    data = []
    for line in file.readlines():
        data.append(line)

Next, you use KFold to split your data into $10$ splits and save to $10$ different files.

# Split the data and save the data
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)

split = 1
for _, split_index in kf.split(data): # Iterate over the splits

    with open('file_{}.txt'.format(split), 'w+', encoding='utf-8') as file: # Open a file with name file_{split number}
        file.write(header) # write the header of the file
        for index in split_index: # Write each point
            file.write(data[index])
    split += 1 # Increase split number

Weka4WS

The Weka4WS is an extension to Weka itself. Yes, it is like an application like Weka, with a Graphical User Interface. From the docs:

Thus, Weka4WS also extends the Weka GUI to enable the invocation of the data mining algorithms that are exposed as Web Services on remote Grid nodes.

However, it is now as easy to run as the normal Weka. You need some additional steps to use it.

The official installation is here: http://scalab.dimes.unical.it/weka4ws/howtos/installation/

Here is a tutorial to run the application on Windows: http://scalab.dimes.unical.it/weka4ws/howtos/run-the-client-on-windows/

This is how the app looks like, this is the KnowledgeFlow screen:

Implementation of a distributed data mining paper

Splitting your dataset

Weka4WS

About