How to annotate text documents with meta-data?

Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document:

I saw the company's manager last day.

To be able to extract information from it, it must be annotated with additional data to be less ambiguous. The process of finding such meta-data is not in question, so assume it is done manually. The question is how to store these data in a way that further analysis on it can be done more conveniently/efficiently?

A possible approach is to use XML tags (see below), but it seems too verbose, and maybe there are better approaches/guidelines for storing such meta-data on text documents.

Person name="John"I/Person saw the Organization name="ACME"company/Organization's
manager Time value="2014-5-29"last day/Time.

Topic text-mining metadata nlp data-cleaning

Category Data Science


Try to use Label Studio. It supports Simple Text & HTML NER tagging and much more.

enter image description here

Input to Label Studio for task on the screenshot (HTML code packed to JSON):

{
    "text": "<div style=\"max-width: 750px\"><div style=\"clear: both\"><div style=\"float: right; display: inline-block; border: 1px solid #F2F3F4; background-color: #F8F9F9; border-radius: 5px; padding: 7px; margin: 10px 0;\"><p><b>Jules</b>: No no, Mr. Wolfe, it's not like that. Your help is definitely appreciated.</p></div></div><div style=\"clear: both\"><div style=\"float: right; display: inline-block; border: 1px solid #F2F3F4; background-color: #F8F9F9; border-radius: 5px; padding: 7px; margin: 10px 0;\"><p><b>Vincent</b>: Look, Mr. Wolfe, I respect you. I just don't like people barking orders at me, that's all.</p></div></div><div style=\"clear: both\"><div style=\"display: inline-block; border: 1px solid #D5F5E3; background-color: #EAFAF1; border-radius: 5px; padding: 7px; margin: 10px 0;\"><p><b>The Wolf</b>: If I'm curt with you, it's because time is a factor. I think fast, I talk fast, and I need you two guys to act fast if you want to get out of this. So pretty please, with sugar on top, clean the car.</p></div></div></div>"
}

Output:

[
    {
        "id": "9fkAdIXgkV",
        "from_name": "ner",
        "to_name": "text",
        "source": "$text",
        "type": "hypertextlabels",
        "value": {
            "start": "/div[1]/div[1]/div[1]/p[1]/b[1]/text()[1]",
            "end": "/div[1]/div[1]/div[1]/p[1]/b[1]/text()[1]",
            "text": "Jules",
            "startOffset": 0,
            "endOffset": 5,
            "htmllabels": [
                "Person"
            ]
        }
    },
    {
        "id": "YMeGv8ndLx",
        "from_name": "ner",
        "to_name": "text",
        "source": "$text",
        "type": "hypertextlabels",
        "value": {
            "start": "/div[1]/div[1]/div[1]/p[1]/text()[1]",
            "end": "/div[1]/div[1]/div[1]/p[1]/text()[1]",
            "text": "Wolfe",
            "startOffset": 13,
            "endOffset": 18,
            "htmllabels": [
                "Organization"
            ]
        }
    },
    {
        "id": "vgGGhXRFcr",
        "from_name": "ner",
        "to_name": "text",
        "source": "$text",
        "type": "hypertextlabels",
        "value": {
            "start": "/div[1]/div[2]/div[1]/p[1]/text()[1]",
            "end": "/div[1]/div[2]/div[1]/p[1]/text()[1]",
            "text": " Look, Mr. Wo",
            "startOffset": 1,
            "endOffset": 14,
            "htmllabels": [
                "Person"
            ]
        }
    },
    {
        "id": "oJxIH-ztQv",
        "from_name": "ner",
        "to_name": "text",
        "source": "$text",
        "type": "hypertextlabels",
        "value": {
            "start": "/div[1]/div[2]/div[1]/p[1]/text()[2]",
            "end": "/div[1]/div[2]/div[1]/p[1]/text()[2]",
            "text": "people bar",
            "startOffset": 38,
            "endOffset": 48,
            "htmllabels": [
                "Organization"
            ]
        }
    }
]

To describe all existed data it is so difficult task, but we can use a data model: http://schema.org/, where are structural types of the information. The prior execution was targeted to implement MarkUp technology, so, it seems can be useful for your task.


Personally I would advocate using something that is both not-specific to the NLP field, and something that is sufficiently general that it can still be used as a tool even when you've started moving beyond this level of metadata. I would especially pick a format that can be used regardless of development environment and one that can keep some basic structure if that becomes relevant (like tokenization)

It might seem strange, but I would honestly suggest JSON. It's extremely well supported, supports a lot of structure, and is flexible enough that you shouldn't have to move from it for not being powerful enough. For your example, something like this:

{'text': 'I saw the company's manager last day.", {'Person': [{'name': 'John'}, {'indices': [0:1]}, etc...]}

The one big advantage you've got over any NLP-specific formats here is that JSON can be parsed in any environment, and since you'll probably have to edit your format anyway, JSON lends itself to very simple edits that give you a short distance to other formats.

You can also implicitly store tokenization information if you want:

{"text": ["I", "saw", "the", "company's", "manager", "last", "day."]}

EDIT: To clarify the mapping of metadata is pretty open, but here's an example:

{'body': '<some_text>',
 'metadata': 
  {'<entity>':
    {'<attribute>': '<value>',
     'location': [<start_index>, <end_index>]
    }
  }
}

Hope that helps, let me know if you've got any more questions.


The brat annotation tool might be useful for you as per my comment. I have tried many of them and this is the best I have found. It has a nice user interface and can support a number of different types of annotations. The annotations are stored in a separate .annot file which contain each annotation as well as its location within the original document. A word of warning though, if you ultimately want to feed the annotations into a classifier like the Stanford NER tool then you will have to do some manipulation to get the data into a format that it will accept.


In general, you don't want to use XML tags to tag documents in this way because tags may overlap.

UIMA, GATE and similar NLP frameworks denote the tags separate from the text. Each tag, such as Person, ACME, John etc. is stored as the position that the tag begins and the position that it ends. So, for the tag ACME, it would be stored as starting a position 11 and ending at position 17.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.