labelling

How to label legit users when trying developing a bot flagging classification model?

Marc

2022年6月2日 14:07

I’m working on a project where I try to flag bots from legit users on social media. The data I collected is not labeled but I have labeled about 17% of it (22k users) thought different techniques. Finding bots was easy as they all have similarities with each other but it's different for legit users. In my labeled data, I have most if not all bots labeled but still have a ton of legit users to label which is really …

Topic: labelling labels python machine-learning

Category: Data Science

too much data to label

Marc

2022年4月1日 17:17

I'm working on a Data Science project to flag bots on Instagram. I collected a lot of data (+80k users) and now I have to label them as bot/legit users. I already flagged 20k users with different techniques but now I feel like I'm gonna have to flag them one by one with will likely take months. Can I just stop and be like "I'm fine with what I have" or is this bad practice? Stopping now would also mean …

Topic: labelling data machine-learning

Category: Data Science

Labelling a dataset for sentiment analysis, which model is the best?

Dan K

2022年3月17日 07:15

I want to do some sentiment analysis on a large text dataset I scraped. From what I've learned so far, I know that I need to either manually label each text data (positive, negative, neutral) or use a pre-trained model like bert and textblob. I want to know which model has best accuracy in sentiment labelling. Both bi-polar (positive, negative) and tri-polar (positive, neutral, negative) are ok for the analysis I want to do. If I want to make my …

Topic: labelling sentiment-analysis classification

Category: Data Science

Why label encoding before split is data leakage?

Anar

2022年3月2日 07:39

I want to ask why Label Encoding before train test split is considered data leakage? From my point of view, it is not. Because, for example, you encode "good" to 2, "neutral" to 1 and "bad" to 0. It will be same for both train and test sets. So, why do we have to split first and then do label encoding?

Topic: test labelling data-leakage training preprocessing

Category: Data Science

Unsupervised clustering for Text labelling

Timoth Dev A

2022年1月25日 13:46

I have millions semi structured text descriptions of a job requirement. Which needs to be labelled, such as the number of hours, years of experience required, shifts, certifications, licensure etc., Need to segregate them and put it in a structured format, was wondering if I can use some unsupervised labelling methods. P.S: Details are not all structured to use regular expressions.

Topic: labelling text-mining nlp

Category: Data Science

Is Hough transform an appropriate line detector for my problem?

Bohniti

2021年11月24日 15:37

I try to get automated labels for images with the help of computer vision. Problem The labels are papyrus fibers on the outer edges of a papyrus fragment. After some research (for example [3]), I come up with the following pipeline: Use binarization (Otsu) Calculate the skeleton Use the Hough transform? Image I shows what happens after I put the result from step 2 into the original image. In Image II, you can see what I want to have after …

Topic: labelling computer-vision feature-extraction

Category: Data Science

Label A records B times or label A*B records

dasjanik

2021年11月1日 03:06

This question concerns pre-training data sourcing. Suppose you have a human workforce of B individuals and a potentially unlimited source of data. The task is labeling images with classes. These classes are somewhat subjective (emotions). This means one individual might label the same image with a different class than another individual. For then using these labeled records as training data on a neural network that predicts classes on images, is it better to 1) have a number of records (A) …

Topic: labelling training image-classification neural-network

Category: Data Science

CRFSuite/Wapiti: How to create intermediary data for running a training?

Sixtyfive

2021年10月14日 07:57

After having asked for and been suggested two pieces of software last week (for training a model to categorize chunks of a string) I'm now struggling to make use of either one of them. It seems that in machine learning (or at least, with CRF?), you can't just train on the training data directly, but you have to go through an intermediary step first.¹ From the CRFsuite tutorial: The next step is to preprocess the training and testing data to …

Topic: labelling nlp

Category: Data Science

Is an image with 5 labels equivalent to 5 images with 1 label?

Joshua Wilkinson

2021年10月13日 15:03

I am collecting data to train an object detection model using and was wondering if 5 labels in the same image and 5 images with 1 label each provided the same quality of input training data. Example: an image with 5 labeled apples vs. 5 images with 1 apple each.

Topic: object-detection labelling machine-learning

Category: Data Science

Software/Library Suggestion: Is there a usable open-source sequence tagger around?

Sixtyfive

2021年9月30日 23:06

(Not sure if this is the right community for the question - please do downvote if stats. or whatever else is more appropriate...) I'm looking for a suggestion for either a command-line tool or library (preferably Python or Ruby, but at this point, anything will do) implementing non-Parts-of-Speech-specific sequence tagging/labelling. If it was PoS-specific but could be re-trained for custom categories, that'd be fine, too. The projects I've found mostly seem to be abandoned PhD thesis codebases or similar and …

Topic: labelling nlp

Category: Data Science

Looking for 2D Point Cloud Labeling tool

Simon Krannig

2021年7月16日 07:49

Trying to understand my high-dimensional Dataset I am using t-SNE to "project" the data onto two dimensions. As is the nature of t-SNE, this will be an experimental, iterative process. I want to be able to keep track of datapoints movements over many iterations. For that I am searching for a tool to label the resulting two dimensional point clouds directly in a plot (by drawing, creating polygons, etc.). Does anybody know of tools that can help with that?

Topic: labelling tsne python

Category: Data Science

How should I construct a binary classifier for small set of positive data and million of unlabeled data?

Deli

2021年5月29日 14:37

Does anyone have suggestions for specific algorithm or implementation for labeled data of only one class and unlabeled data that can be from either classes? And I'm unsure what is the proportion of Class A to B that exists within the unlabeled data and also my labeled data is not randomly chosen.

Topic: labelling classification machine-learning

Category: Data Science

Solutions for Labelling Training Data for Binary Classification Problems

ha9u63ar

2021年4月8日 10:03

I have a huge dataset for which I am trying to use an 80-20 (Holdout method) approach to train and test my model. However, the dataset I have been given has 6m rows. The objective is to train+test+validate the model before using live data traffic for real-time predictions. The expected result here is "It's not corrupted with 97% accuracy" which is implementation details and output of some Jupyter notebook etc. My Question is - Is there any alternatives than manually …

Topic: labelling semi-supervised-learning classification

Category: Data Science

Python package for machine-learning aided data labelling

Pieter

2021年2月10日 16:47

In a lot of cases unlabelled data needs to be transformed to labelled data. The best solution is to use (multiple) human classifiers. However, going to all the data by hand (i.e. in text-mining or image-processing) is often a daunting task. Is there software that can combine human classifiers and machine-learning techniques in real time? I am especially interested in python packages. To illustrate, classifying images from video streams is very repetitive. After 100 images (from different streams) a machine-learning …

Topic: labelling labels active-learning python machine-learning

Category: Data Science

(Labeled, if possible) time-series datasets for anomaly detection

Guillermo Mosse

2021年1月21日 08:43

I would like to create a big list of available time-series datasets for anomaly detection. I'm especially interested in the following: The time-series data should be segmented into cycles Ideally, these cycles should be of the same length These cycles should be labeled as normal/anomalous But anything goes. I will be sharing the ones I found below.

Topic: labelling labels anomaly-detection time-series dataset

Category: Data Science

Sampling items from a population of subpopulations

dzieciou

2021年1月16日 14:50

I have a population of $n$ items to label and a budget to label only $m$ ($m << n$) of them before training. The population can be partitioned into subpopulations, recursively. In other words, the whole population can be represented as a tree of subpopulations, $x_1$ can be split into $x_2$ and $x_7$ subpopulations, $x_2$ into $x_3$ and $x_4$, etc. Some subpopulations are more diverse and have more subpopulations. What algorithm should I use to sample $m$ items, so that …

Topic: labelling sampling

Category: Data Science

Best practices to image annotation for object detection when objects overlap

Emanuel Huber

2021年1月14日 18:16

If I have the following example: How should I annotate the bottom image? I can think of those scenarios: Create a large box that captures class B and a second box that captures entirely class A. This will lead to overlapping boxes. Result: Create two boxes that do not overlap, but also do not covers the objects entirely. Result: I think the right choice is scenario 1, but will the algorithm be able to capture this difference? Considering SSD, YOLO.

Topic: object-detection labelling

Category: Data Science

How do I label images faster

Vendetta

2020年10月23日 16:01

I have around 1600 images extracted from videos shot at night time. I am labeling each image and trying to be as accurate as I can in assigning bounding boxes. I am labeling vehicles and traffic light/traffic signs. This is very time-consuming, I am wondering if someone have experience or done this before and can advise me on some automated methods of labeling night time images. The objects of interests are usually appear to be quite small. An example labelled …

Topic: labelling computer-vision

Category: Data Science

Online Audio annotation tools

Aidos

2020年6月1日 06:31

I need to find a decent online annotation tool to transcribe audio. There are some requirements for a potential tool: I should be able to deliver audio files to a few labelers. I should be able to track which files went to which labeler. It should be safe in terms of data storage. Any suggestions?

Topic: annotation labelling

Category: Data Science

How to pass more than 2 input columns to a Deep learning Keras model for sequence tagging/labeling

Sneha.Priya

2020年4月22日 13:28

I have to build a neural network which extract relationship between two entities.Input should be: Input text/paragraph, vocabulary of entities and relationship phrases that system should recognize. Output is sequence of tags and length of output sequence and input text/paragraph is same. Dataset is a CSV file having 3 input columns(input text, entities in text, relationship between 2 entities) and 1 output column. I am using Keras library to build this model. Example-input1: zomato acquires uber; input2: zomato, uber; input3: …

Topic: labelling keras deep-learning nlp python

Category: Data Science

About