Manual Data Cleanup Tools

I am writing an ETL pipeline for geospatial data of the form place_name,address,longitude,latitude,id_linking_to_other_dataset

As the last step in the pipeline, I would like to apply manual transformations submitted by reviewers. Some of these transformations might be (borrowing from Google maps suggest edits docs):

  • Change a place's name, location, or the id linking it to another dataset
  • Mark a place private or non-existent
  • Mark a place as moved or duplicated

I don't have a ton of records (about 5000) but would like to manage this manual correction using best-practices. Ideally, corrections could be version controlled and applied as the last step in an ETL pipeline (even if other parts of the pipeline change).

There are lots of good tools handing off data annotation for ML, but I'm not seeing resources for this type of correction. Thoughts on useful tools?

Topic annotation data-cleaning

Category Data Science


i would recommend loading it into a dataframe and then use standard pandas functionality. (str.replace,loc,iloc)

My answer is a bit vague, sorry for that but I would need to know a bit more of the technical details about your ETL pipeline and which format the data is which you want to change.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.