How to label legit users when trying developing a bot flagging classification model?

I’m working on a project where I try to flag bots from legit users on social media. The data I collected is not labeled but I have labeled about 17% of it (22k users) thought different techniques. Finding bots was easy as they all have similarities with each other but it's different for legit users. In my labeled data, I have most if not all bots labeled but still have a ton of legit users to label which is really …
Category: Data Science

too much data to label

I'm working on a Data Science project to flag bots on Instagram. I collected a lot of data (+80k users) and now I have to label them as bot/legit users. I already flagged 20k users with different techniques but now I feel like I'm gonna have to flag them one by one with will likely take months. Can I just stop and be like "I'm fine with what I have" or is this bad practice? Stopping now would also mean …
Category: Data Science

Labelling a dataset for sentiment analysis, which model is the best?

I want to do some sentiment analysis on a large text dataset I scraped. From what I've learned so far, I know that I need to either manually label each text data (positive, negative, neutral) or use a pre-trained model like bert and textblob. I want to know which model has best accuracy in sentiment labelling. Both bi-polar (positive, negative) and tri-polar (positive, neutral, negative) are ok for the analysis I want to do. If I want to make my …
Category: Data Science

Unsupervised clustering for Text labelling

I have millions semi structured text descriptions of a job requirement. Which needs to be labelled, such as the number of hours, years of experience required, shifts, certifications, licensure etc., Need to segregate them and put it in a structured format, was wondering if I can use some unsupervised labelling methods. P.S: Details are not all structured to use regular expressions.
Category: Data Science

Is Hough transform an appropriate line detector for my problem?

I try to get automated labels for images with the help of computer vision. Problem The labels are papyrus fibers on the outer edges of a papyrus fragment. After some research (for example [3]), I come up with the following pipeline: Use binarization (Otsu) Calculate the skeleton Use the Hough transform? Image I shows what happens after I put the result from step 2 into the original image. In Image II, you can see what I want to have after …
Category: Data Science

Label A records B times or label A*B records

This question concerns pre-training data sourcing. Suppose you have a human workforce of B individuals and a potentially unlimited source of data. The task is labeling images with classes. These classes are somewhat subjective (emotions). This means one individual might label the same image with a different class than another individual. For then using these labeled records as training data on a neural network that predicts classes on images, is it better to 1) have a number of records (A) …
Category: Data Science

CRFSuite/Wapiti: How to create intermediary data for running a training?

After having asked for and been suggested two pieces of software last week (for training a model to categorize chunks of a string) I'm now struggling to make use of either one of them. It seems that in machine learning (or at least, with CRF?), you can't just train on the training data directly, but you have to go through an intermediary step first.¹ From the CRFsuite tutorial: The next step is to preprocess the training and testing data to …
Topic: labelling nlp
Category: Data Science

Software/Library Suggestion: Is there a usable open-source sequence tagger around?

(Not sure if this is the right community for the question - please do downvote if stats. or whatever else is more appropriate...) I'm looking for a suggestion for either a command-line tool or library (preferably Python or Ruby, but at this point, anything will do) implementing non-Parts-of-Speech-specific sequence tagging/labelling. If it was PoS-specific but could be re-trained for custom categories, that'd be fine, too. The projects I've found mostly seem to be abandoned PhD thesis codebases or similar and …
Topic: labelling nlp
Category: Data Science

Looking for 2D Point Cloud Labeling tool

Trying to understand my high-dimensional Dataset I am using t-SNE to "project" the data onto two dimensions. As is the nature of t-SNE, this will be an experimental, iterative process. I want to be able to keep track of datapoints movements over many iterations. For that I am searching for a tool to label the resulting two dimensional point clouds directly in a plot (by drawing, creating polygons, etc.). Does anybody know of tools that can help with that?
Category: Data Science

How should I construct a binary classifier for small set of positive data and million of unlabeled data?

Does anyone have suggestions for specific algorithm or implementation for labeled data of only one class and unlabeled data that can be from either classes? And I'm unsure what is the proportion of Class A to B that exists within the unlabeled data and also my labeled data is not randomly chosen.
Category: Data Science

Solutions for Labelling Training Data for Binary Classification Problems

I have a huge dataset for which I am trying to use an 80-20 (Holdout method) approach to train and test my model. However, the dataset I have been given has 6m rows. The objective is to train+test+validate the model before using live data traffic for real-time predictions. The expected result here is "It's not corrupted with 97% accuracy" which is implementation details and output of some Jupyter notebook etc. My Question is - Is there any alternatives than manually …
Category: Data Science

Python package for machine-learning aided data labelling

In a lot of cases unlabelled data needs to be transformed to labelled data. The best solution is to use (multiple) human classifiers. However, going to all the data by hand (i.e. in text-mining or image-processing) is often a daunting task. Is there software that can combine human classifiers and machine-learning techniques in real time? I am especially interested in python packages. To illustrate, classifying images from video streams is very repetitive. After 100 images (from different streams) a machine-learning …
Category: Data Science

(Labeled, if possible) time-series datasets for anomaly detection

I would like to create a big list of available time-series datasets for anomaly detection. I'm especially interested in the following: The time-series data should be segmented into cycles Ideally, these cycles should be of the same length These cycles should be labeled as normal/anomalous But anything goes. I will be sharing the ones I found below.
Category: Data Science

Sampling items from a population of subpopulations

I have a population of $n$ items to label and a budget to label only $m$ ($m << n$) of them before training. The population can be partitioned into subpopulations, recursively. In other words, the whole population can be represented as a tree of subpopulations, $x_1$ can be split into $x_2$ and $x_7$ subpopulations, $x_2$ into $x_3$ and $x_4$, etc. Some subpopulations are more diverse and have more subpopulations. What algorithm should I use to sample $m$ items, so that …
Category: Data Science

Best practices to image annotation for object detection when objects overlap

If I have the following example: How should I annotate the bottom image? I can think of those scenarios: Create a large box that captures class B and a second box that captures entirely class A. This will lead to overlapping boxes. Result: Create two boxes that do not overlap, but also do not covers the objects entirely. Result: I think the right choice is scenario 1, but will the algorithm be able to capture this difference? Considering SSD, YOLO.
Category: Data Science

How do I label images faster

I have around 1600 images extracted from videos shot at night time. I am labeling each image and trying to be as accurate as I can in assigning bounding boxes. I am labeling vehicles and traffic light/traffic signs. This is very time-consuming, I am wondering if someone have experience or done this before and can advise me on some automated methods of labeling night time images. The objects of interests are usually appear to be quite small. An example labelled …
Category: Data Science

Online Audio annotation tools

I need to find a decent online annotation tool to transcribe audio. There are some requirements for a potential tool: I should be able to deliver audio files to a few labelers. I should be able to track which files went to which labeler. It should be safe in terms of data storage. Any suggestions?
Category: Data Science

How to pass more than 2 input columns to a Deep learning Keras model for sequence tagging/labeling

I have to build a neural network which extract relationship between two entities.Input should be: Input text/paragraph, vocabulary of entities and relationship phrases that system should recognize. Output is sequence of tags and length of output sequence and input text/paragraph is same. Dataset is a CSV file having 3 input columns(input text, entities in text, relationship between 2 entities) and 1 output column. I am using Keras library to build this model. Example-input1: zomato acquires uber; input2: zomato, uber; input3: …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.