It seems like a lot of noteworthy AI tools are being trained on datasets generated by web crawlers rather than human-edited, human-compiled corpora (Facebook Translate, GPT-3). In general, it sounds more ideal to have an automatic and universal way of generating a dataset. Is there any ubiquitous web crawler which does basically the same thing as Common Crawl but has a parameter for “language sought”? In other words, generate a web-crawled dataset in language X? (Background: I’d like to create …
I am trying to build a scraper that will run continuously and save the tweets from a list of users instantaneously or within seconds of the user tweeting it. It could save the tweet details to a continuously updated csv file.
I have a rather simple data scraping task, but my knowledge of web scraping is limited. I have a excel file containing the names of 500 cities in a column, and I'd like to find their distance from a fixed city, say Montreal. I have found this website which gives the desired distance (in both km and miles). For each of these 500 cities, I'd like to read the name in the excel file, enter it in the "to" box, …
How do I scrape a website that basically looks like google with just a giant searchbar in the middle of the screen. From it you can search after various companies and their stats. I have a list of 1000 companies I want to get information about. I want some bot to search each company from my list in the search bar, open the specific company's info window and extract a certain company code that exist on each page for each …
In addition to our list of publicly available datasets, I'd like to know if there is any list of publicly available news datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics of the data available were added. Such information should be, and is not limited to: the name of the news network / news aggregator; what kind of news information it provides (title, snippet, full-article, date, author, url, ...); whether it allows for …
As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics of the data available were added. Such information should be, and is not limited to: the name of the social network; what kind of user information it provides (posts, profile, friendship network, ...); whether it allows for crawling its …
I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd like to get data on all employees of a given company, which you can do manually on the site but is not possible through the API. import.io would be perfect if it recognised the LinkedIn pagination (see end …
There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the approach to build such database? Are there any opensource or collaborated live repositories? somewhere i read instead of having local database of the entire internet, one can index and use it for faster search. I know Elastic Search …
I want to know if there is any way that I can crawl customer reviews for particular products from amazon without being blocked. At the moment, my crawler is blocked after a few times. Any idea will be appreciated.
For an upcoming project, I'm mining textual posts from an online forum, using Scrapy. What is the best way to store this text data? I'm thinking of simply exporting it into a JSON file, but is there a better format? Or does it not matter?
I'm going to train an ml algorithm to qualify potential sales leads based upon company descriptions. To do this, I need to find the company descriptions programatically. E.g. given a long list of company names, how can I find descriptions for these companies. Here are my current techniques, which works ok, but not great: Using Google Search API to and fetch summary from the results page (Google normally gives the company website as the first result when searching by company …
I am searching a scientific database for abstracts of papers containing the words project management. Here is the link: For getting abstracts, I need to click on any paper and open a new page. How can I do that for 68 papers? I program in R and bash.
I want to scrape some data from a website. I have used import.io but still not much satisfied.. can any of you suggest about it.. whats the best tool to get the unstructured data from web