Text classification with Weka (unlimited dependent variable values)

Question

Text classification with Weka (unlimited dependent variable values)

Sue Nile

2022年3月8日 11:52

In our dataset we have 2 attributes, citizen and nric. The rule is if citizen is US, then the result should be the nric value, otherwise Non-US.

Could you please suggest which algorithm in Weka I should use and most importantly how to defind this dataset in ARFF format.

Here to note is nric can be any random text value. There is no fixed value set for nric and result.

Train dataset

citizen	nric	result
US	US123	US123
CA	CA332	Non-US
US	US223	US223
US	US776	US776
DE	DE112	Non-US
SG	SG762	Non-US
MM	MM001	Non-US

Test dataset

citizen	nric	result
US	US777	US777
JP	JP919	Non-US
IN	IN010	Non-US

Topic machine-learning-model weka classification

Category Data Science

Erwan · Accepted Answer · 2022年3月8日 11:48

ML is not the right approach for this task because it's deterministic, i.e. it's possible to calculate the result directly from the instance.

A simple code like this:

if citizen == 'US' 
  return nric
else
  return 'non-US'

is much more efficient and more accurate than a ML classification model.

A ML model should be used for tasks where there is uncertainty about the result, i.e. where statistical calculations on a large dataset are needed to find the patterns which link the features with the labels.

Also this is not text classification, this term is used when the input is unstructured text, for instance full sentences. In this case the data is structured.

Text classification with Weka (unlimited dependent variable values)

About