Capture pattern in python

I would like to capture the following pattern using python anyprefix-emp-employee id_id-designation id_sc-scale id

Example data

strings = ["humanresourc-emp-001_id-01_sc-01","itoperation-emp-002_id-02_sc-12","Generalsection-emp-003_id-03_sc-10"]

Expected Output:

[('emp-001', 'id-01', 'sc-01'), ('emp-002', 'id-02', 'sc-12'), ('emp-003', 'id-03', 'sc-10')]

How can i do it using python.

Topic scraping python

Category Data Science


You can also solve this problem by the following ways;

import re
regex = re.compile("(emp-.+)_(id-.+)_(sc-.+)")
strings = ["humanresourc-emp-001_id-01_sc-01","itoperation-emp-002_id-02_sc-12","Generalsection-emp-003_id-03_sc-10"]
print([regex.findall(s)[0] for s in strings])

Answer

[tuple(s[s.find("-") + 1:].split("_")) for s in strings]

Explanation

Each string has a nice regular format:

  1. a description
  2. employee number
  3. id number
  4. 'sc' number (don't know what that could be)

These attributes are all separated by an underscore: _.

You're result doesn't need to description, so find the place of the end of the description and remove it. I find the first hyphen (-) then only keep everything after that.

Then I split the remaing string into three strings, using split("_").

This returns the three parts you want, which I then put into a tuple.

I perform this for each string in strings.

You can put it in a function like this:

def extract_tags(strings):
    result = [tuple(s[s.find("-") + 1:].split("_")) for s in strings]
    return result

Here is the output on your test string:

[('emp-001', 'id-01', 'sc-01'),
 ('emp-002', 'id-02', 'sc-12'),
 ('emp-003', 'id-03', 'sc-10')]

Try this:

import re
strings = ["humanresourc-emp-001_id-01_sc-01","itoperation-emp-002_id-02_sc-12","Generalsection-emp-003_id-03_sc-10"]
new_list = []
pattern = '[a-zA-Z]+?[-]{1}(?P<empid>emp-[0-9]{3})_(?P<desid>id-[0-9]{2})_(?P<sclid>sc-[0-9]{2})'
for test_string in strings:
    m = re.search(pattern, test_string)
    new_tuple = tuple([m.group('empid'), m.group('desid'), m.group('sclid')])
    new_list.append(new_tuple)

Not sure if this gets you exactly what you want, but the regex pattern works on the data provided.

Here is my output:

[('emp-001', 'id-01', 'sc-01'), ('emp-002', 'id-02', 'sc-12'), ('emp-003', 'id-03', 'sc-10')]

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.