Text classification with Weka (unlimited dependent variable values)

In our dataset we have 2 attributes, citizen and nric. The rule is if citizen is US, then the result should be the nric value, otherwise Non-US.

Could you please suggest which algorithm in Weka I should use and most importantly how to defind this dataset in ARFF format.

Here to note is nric can be any random text value. There is no fixed value set for nric and result.

Train dataset

citizen nric result
US US123 US123
CA CA332 Non-US
US US223 US223
US US776 US776
DE DE112 Non-US
SG SG762 Non-US
MM MM001 Non-US

Test dataset

citizen nric result
US US777 US777
JP JP919 Non-US
IN IN010 Non-US

Topic machine-learning-model weka classification

Category Data Science


ML is not the right approach for this task because it's deterministic, i.e. it's possible to calculate the result directly from the instance.

A simple code like this:

if citizen == 'US' 
  return nric
else
  return 'non-US'

is much more efficient and more accurate than a ML classification model.

A ML model should be used for tasks where there is uncertainty about the result, i.e. where statistical calculations on a large dataset are needed to find the patterns which link the features with the labels.

Also this is not text classification, this term is used when the input is unstructured text, for instance full sentences. In this case the data is structured.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.