Is there any way to analyze the format of text strings?

Question

Is there any way to analyze the format of text strings?

cosmarchy

2021年5月7日 05:24

I have a lot of data which basically consists of alphanumeric text on individual lines which can very in length and contain delimiters.

Since there are many thousands of lines of text, I'm looking to see whether there is an automated way to determine the different formats of text.

A sample of which is:

90665013-163
90731046-103
90840069-009
90847069-009
90880046-103
90889046-103
90897-051
9089744-103
9089844-103
90901-46909
90901-lep
9091046-103
9091046-909
90764046-1037
can10043E
can90065-op016
9094344-103
90669j4-4438718
90666ie79
90664046-103
90710-077
004-919
4A1900935
can90064-op016
can90066-E016
9094544-103
9094646-103
4A1900597
4A1900588
4A9443198
4A94431

So, from this sample, we can see that there are several lines that are the format of 8 numbers, a dash, and then three numbers. Another format is 3 numbers, a dash, and then another 3 numbers.

I'm looking for a way to determine all of the unique formats available...

Not sure whether this is possible either by some code, an online web service, or perhaps a feature of some software but I'm asking on the off chance this actually possible.

Topic data-analysis data-formats

Category Data Science

Abhishek Verma · Accepted Answer · 2021年5月7日 05:24

A machine learning approach to this would be create character-level encoding for the data. Then, you can run K Means (use silhouette score for judging the fit) or Louvain community detection (doesn't need the number of clusters as input). You can finally look at cluster heads for the formats.

Is there any way to analyze the format of text strings?

About