Which model is used for document extraction (CamScanner, Microsoft Lens etc)

I want to start a small project where I'd create a model(s) that would extract document from a picture and rescale it, something like CamScanner or Microsoft Lens apps do.

I've gathered a small dataset just to prototype the concept, but I'm not sure what might be the best approach to label the data.

  1. Using bounding boxes - this might work best to locate the document, but it would bring some noise to it since the picture might be under some angle or document could be held in hand etc. so it might require further processing to eliminate background noise.
  2. Using mask-r-cnn will probably do a good job to isolate the document, but I guess it would be tricky to reshape/center later on since it's possible to get a irregularly shaped mask (for example if someone is holding it in hand, finger holding the document might get excluded from the mask, so some extrapolation will be needed probably)
  3. My idea was to use keypoints like they do in pose estimation models, where the keypoints would be the edges of the document and then they would be connected by a straight line to isolate document and then re-center it.

Has anyone worked on this type of problem, or has and idea how the apps mentioned above are handling this? Probably there are some other approaches that can be used that I'm unaware of?

Topic image-segmentation faster-rcnn cnn image-classification neural-network

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.