Generation of medical institution names: training corpora?

My question is quite similar to this one: Generation of institution names. I need to be able to produce 'fake' names of medical institutions, specifically to create data for unit tests. Unfortunately, simple tools like Faker do not work well for this task, so I am interested in a more sophisticated solution, possibly involving some NER model(s). My question here is where can I get text corpora for training the model? The texts must contain (human-)recognizable names of medical institutions, preferably in a number of languages. I have seen allusions that this might be done by scraping PubMed or other Web sources - are there possibly some concrete examples or howtos?

Topic text-generation named-entity-recognition python

Category Data Science


I have found a way to create the corpus of potentially medical institutions by requesting the NCBI RESTful server, following the description in this link.

First, you send an ESearch request containing some searching criteria (e.g. 'radiology', 'dicom', 'segmentation' - or whatever). As a response you obtain an XML document with a list of PubMed Ids.

Then you can send a EFetch request containing the IDs, and as a response you will obtain an XML document with an tag for each author. That data can then be used to build the corpus (in the 1st approximation)


I can think of a couple options to collect a sample of medical institutions:

  • Wikipedia has a list of hospitals by country (isn't Wikipedia amazing?)

  • Many countries have some kind of national directory of medical institutions, but that would probably be difficult to scrap and specific to each country.

  • UMLS has a category ("semantic group") for "Health Care Related Organization" (T093, see here), which means that a list of such organizations can be directly collected from the UMLS data. I thought this would be a good option, but I did a quick test and it appears to contain only names of departments, no proper nouns, for example:

    Community occupational therapy clinic
    Abortion Center
    Area Health Education Center
    

Given that UMLS is closely related to PubMed, my guess is that it's not a very good direction, but maybe I didn't dig deep enough. Fair warning: Processing the whole PubMed/PMC is quite a lot of work in my experience.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.