crawling

Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?

Peter Elbert

2021年11月18日 23:15

It seems like a lot of noteworthy AI tools are being trained on datasets generated by web crawlers rather than human-edited, human-compiled corpora (Facebook Translate, GPT-3). In general, it sounds more ideal to have an automatic and universal way of generating a dataset. Is there any ubiquitous web crawler which does basically the same thing as Common Crawl but has a parameter for “language sought”? In other words, generate a web-crawled dataset in language X? (Background: I’d like to create …

Topic: openai-gpt crawling nlp

Category: Data Science

Is there a way to scrape tweets in realtime from a list of specified users?

niusoski

2021年10月18日 20:37

I am trying to build a scraper that will run continuously and save the tweets from a list of users instantaneously or within seconds of the user tweeting it. It could save the tweet details to a continuously updated csv file.

Topic: web-scraping scraping crawling data-mining

Category: Data Science

Data extraction using crawlers

Jay

2021年9月27日 14:26

I have a rather simple data scraping task, but my knowledge of web scraping is limited. I have a excel file containing the names of 500 cities in a column, and I'd like to find their distance from a fixed city, say Montreal. I have found this website which gives the desired distance (in both km and miles). For each of these 500 cities, I'd like to read the name in the excel file, enter it in the "to" box, …

Topic: information-extraction web-scraping crawling

Category: Data Science

How to scrape a website with a searchbar

Ceylon

2021年3月15日 11:33

How do I scrape a website that basically looks like google with just a giant searchbar in the middle of the screen. From it you can search after various companies and their stats. I have a list of 1000 companies I want to get information about. I want some bot to search each company from my list in the search bar, open the specific company's info window and extract a certain company code that exist on each page for each …

Topic: scraping crawling data-mining

Category: Data Science

Publicly available news APIs/datasets?

stevec

2021年2月20日 11:56

In addition to our list of publicly available datasets, I'd like to know if there is any list of publicly available news datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics of the data available were added. Such information should be, and is not limited to: the name of the news network / news aggregator; what kind of news information it provides (title, snippet, full-article, date, author, url, ...); whether it allows for …

Topic: crawling dataset open-source

Category: Data Science

Publicly available social network datasets/APIs

Rubens

2021年2月9日 04:27

As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics of the data available were added. Such information should be, and is not limited to: the name of the social network; what kind of user information it provides (posts, profile, friendship network, ...); whether it allows for crawling its …

Topic: crawling dataset open-source

Category: Data Science

LinkedIn web scraping

christopherlovell

2020年8月16日 16:53

I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd like to get data on all employees of a given company, which you can do manually on the site but is not possible through the API. import.io would be perfect if it recognised the LinkedIn pagination (see end …

Topic: scraping crawling social-network-analysis data-mining

Category: Data Science

corpus development for plagiarism detection

Shiva

2019年12月12日 16:03

There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the approach to build such database? Are there any opensource or collaborated live repositories? somewhere i read instead of having local database of the entire internet, one can index and use it for faster search. I know Elastic Search …

Topic: crawling python

Category: Data Science

Crawling customer reviews from Amazon

bensw

2018年12月11日 10:29

I want to know if there is any way that I can crawl customer reviews for particular products from amazon without being blocked. At the moment, my crawler is blocked after a few times. Any idea will be appreciated.

Topic: scraping crawling

Category: Data Science

Format for storing textual data

cakesofwrath

2017年7月18日 08:40

For an upcoming project, I'm mining textual posts from an online forum, using Scrapy. What is the best way to store this text data? I'm thinking of simply exporting it into a JSON file, but is there a better format? Or does it not matter?

Topic: crawling text-mining

Category: Data Science

How can I find company descriptions for a long list of companies?

Per Borgen

2016年5月9日 07:50

I'm going to train an ml algorithm to qualify potential sales leads based upon company descriptions. To do this, I need to find the company descriptions programatically. E.g. given a long list of company names, how can I find descriptions for these companies. Here are my current techniques, which works ok, but not great: Using Google Search API to and fetch summary from the results page (Google normally gives the company website as the first result when searching by company …

Topic: data scraping crawling

Category: Data Science

Web Scraping - a scientific database

Hamideh

2015年11月19日 18:12

I am searching a scientific database for abstracts of papers containing the words project management. Here is the link: For getting abstracts, I need to click on any paper and open a new page. How can I do that for 68 papers? I program in R and bash.

Topic: scraping crawling r

Category: Data Science

Looking for Web scraping tool for unstructured data

cap

2015年1月6日 01:26

I want to scrape some data from a website. I have used import.io but still not much satisfied.. can any of you suggest about it.. whats the best tool to get the unstructured data from web

Topic: crawling tools

Category: Data Science

About