Document understanding - sentence length prediction

the subject is taken almost verbatim from this paper https://arxiv.org/pdf/2108.02923.pdf. One of the tasks , is to be able to tell, in a document, if 2 words are part of the same phrase. For e.g. if the KEY is "Name" and the VALUE against it is "MAD MAX", can we identify MAD and MAX as belonging to the same entity. They show up as separate entities because the ROI algorithm (RCNN and family) will more or less draw bounding boxes …
Category: Data Science

Object Detection: Unusual warning while training Detectron2 Faster R-CNN

I am trying to train a Detectron2 faster_rcnn_R_50_FPN_3x model on a custom dataset, pretrained on PublayNet Dataset. While training, I am getting the following warning: WARNING [01/14 14:35:22 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.cls_score.weight' to the model due to incompatible shapes: (7, 1024) in the checkpoint but (6, 1024) in the model! You might want to double check if this is expected. WARNING [01/14 14:35:22 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.cls_score.bias' to the model due to incompatible shapes: (7,) in the checkpoint …
Category: Data Science

Entity Linking for Receipts

I am building a model for reading receipts from their mobile snapshots. After the receipt is OCR'd, I plan to use a variation on LayoutLM for entity extraction. Entities are: "quantity", "price-per-unit", "product-name", "items-price", etc. What is the best model to consider to link all these entities into a single receipt item, so the final result looks like: "items": [ {"product": ..., "unit_price": ..., "price_paid": ..., "quantity": ..., }, ... ]
Category: Data Science

What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?

I've tried reading the other answers on this topic but I'm unsure if I understand completely. For my dataset, I have a series of tagged documents, "good" or "bad." Each document belongs to an entity, and each entity has a different number of documents. Eventually, I'd like to create a classifier to detect whether or not an entity's document is good or bad and to also see what sentences are most similar to the good/bad tag. All that being said, …
Category: Data Science

"Object" Detection in Textual Data

I have a task where the input is a parsed document (i.e., full text in 1 string or tokens) and I need to classify parts of the text into say 5 classes (i.e., 5 tokens from the entire text are labeled into 5 different classes). Example: Document #1: "... cat ..." (the token "cat" belongs to class "0" which is animals) Document #2: "... fish ..." (the token "fish" belongs to class "0" which is animals) It is important to …
Category: Data Science

Document clustering to merge common labels

I am building a recommendation system and I have to clean up some of the labels that I have. For example of the data df['resolution_modified'].value_counts() Gives 105829 It is recommended to replace scanner 1732 It is recommended to reboot station 1483 It is recommended to replace printer 881 It is recommended to replace keyboard 700 ... It is recommended to update both computers in erc to ensure y be compliant with acme 1 It is recommended to configure and i …
Category: Data Science

Identify Resume Structure

I am trying to build a resume parser (from PDF to JSON). After extracting text from a pdf as one long string, how would you split the string into different sections like the red lines show. Resumes have different formats and people use different labels for these sections. Is there any machine learning technique that I could look into? Thanks! .
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.