How Can I Process SageMaker Ground Truth NER JSON Output into DataFrame?

Question

How Can I Process SageMaker Ground Truth NER JSON Output into DataFrame?

Timothy Hartanto

2021年12月21日 09:48

So, I've recently created a job using AWS SageMaker Ground Truth for NER purposes, and have received an output in the form a manifest file. I'm now trying to process the manifest file into a dataframe, and I'm failing greatly.

The JSON file is incredibly complex. Here's an example of it based on the documentation:

{
    source: Amazon SageMaker is a cloud machine-learning platform that was launched in November 2017. SageMaker enables developers to create, train, and deploy machine-learning (ML) models in the cloud. SageMaker also enables developers to deploy ML models on embedded systems and edge-devices,
    ner-labeling-job-attribute-name: {
        annotations: {
            labels: [
                {
                    label: Date,
                    shortDisplayName: dt
                },
                {
                    label: Verb,
                    shortDisplayName: vb
                },
                {
                    label: Thing,
                    shortDisplayName: tng
                },
                {
                    label: People,
                    shortDisplayName: ppl
                }
            ],
            entities: [
                {
                    label: Thing,
                    startOffset: 22,
                    endOffset: 53
                },
                {
                    label: Thing,
                    startOffset: 269,
                    endOffset: 281
                },
                {
                    label: Verb,
                    startOffset: 63,
                    endOffset: 71
                },
                {
                    label: Verb,
                    startOffset: 228,
                    endOffset: 234
                },
                {
                    label: Date,
                    startOffset: 75,
                    endOffset: 88
                },
                {
                    label: People,
                    startOffset: 108,
                    endOffset: 118
                },
                {
                    label: People,
                    startOffset: 214,
                    endOffset: 224
                }
            ]
        }
    },
    ner-labeling-job-attribute-name-metadata: {
        job-name: labeling-job/example-ner-labeling-job,
        type: groundtruth/text-span,
        creation-date: 2020-10-29T00:40:39.398470,
        human-annotated: yes,
        entities: [
            {
                confidence: 0
            },
            {
                confidence: 0
            },
            {
                confidence: 0
            },
            {
                confidence: 0
            },
            {
                confidence: 0
            },
            {
                confidence: 0
            },
            {
                confidence: 0
            }
        ]
    }
}

So far, I've only been able to extract the source and the entities, but now the dataframe has a list of dictionaries on its second column.

How should I process the JSON file into a DataFrame using Pandas? Or is there a better way to process this output?

Many thanks in advance.

Edit: Here's what I'm hoping to see

Topic sagemaker feature-engineering named-entity-recognition aws pandas

Category Data Science

Oxbowerce · Accepted Answer · 2021年12月16日 10:06

You can use base python syntax to transform the dictionary into a format that would work for pandas before converting it to a dataframe:

from itertools import groupby
import pandas as pd

entities = data["ner-labeling-job-attribute-name"]["annotations"]["entities"]
entities = [
    (entity["label"], data["source"][entity["startOffset"]:entity["endOffset"]])
    for entity in entities
]
entities = {
    key: [[x[1] for x in data]]
    for key, data in groupby(entities, lambda x: x[0])
}
df = pd.DataFrame({"Source": data["source"], **entities}).to_markdown(index=False)

Source	Thing	Verb	Date	People
Amazon SageMaker is a cloud machine-learning platform that was launched in November 2017. SageMaker enables developers to create, train, and deploy machine-learning (ML) models in the cloud. SageMaker also enables developers to deploy ML models on embedded systems and edge-devices	['cloud machine-learning platform', 'edge-devices']	['launched', 'deploy']	['November 2017']	['developers', 'developers']

How Can I Process SageMaker Ground Truth NER JSON Output into DataFrame?

About