How Can I Process SageMaker Ground Truth NER JSON Output into DataFrame?
So, I've recently created a job using AWS SageMaker Ground Truth for NER purposes, and have received an output in the form a manifest file. I'm now trying to process the manifest file into a dataframe, and I'm failing greatly.
The JSON file is incredibly complex. Here's an example of it based on the documentation:
{
source: Amazon SageMaker is a cloud machine-learning platform that was launched in November 2017. SageMaker enables developers to create, train, and deploy machine-learning (ML) models in the cloud. SageMaker also enables developers to deploy ML models on embedded systems and edge-devices,
ner-labeling-job-attribute-name: {
annotations: {
labels: [
{
label: Date,
shortDisplayName: dt
},
{
label: Verb,
shortDisplayName: vb
},
{
label: Thing,
shortDisplayName: tng
},
{
label: People,
shortDisplayName: ppl
}
],
entities: [
{
label: Thing,
startOffset: 22,
endOffset: 53
},
{
label: Thing,
startOffset: 269,
endOffset: 281
},
{
label: Verb,
startOffset: 63,
endOffset: 71
},
{
label: Verb,
startOffset: 228,
endOffset: 234
},
{
label: Date,
startOffset: 75,
endOffset: 88
},
{
label: People,
startOffset: 108,
endOffset: 118
},
{
label: People,
startOffset: 214,
endOffset: 224
}
]
}
},
ner-labeling-job-attribute-name-metadata: {
job-name: labeling-job/example-ner-labeling-job,
type: groundtruth/text-span,
creation-date: 2020-10-29T00:40:39.398470,
human-annotated: yes,
entities: [
{
confidence: 0
},
{
confidence: 0
},
{
confidence: 0
},
{
confidence: 0
},
{
confidence: 0
},
{
confidence: 0
},
{
confidence: 0
}
]
}
}
So far, I've only been able to extract the source
and the entities
, but now the dataframe has a list of dictionaries on its second column.
How should I process the JSON file into a DataFrame using Pandas
? Or is there a better way to process this output?
Many thanks in advance.
Topic sagemaker feature-engineering named-entity-recognition aws pandas
Category Data Science