Crawling customer reviews from Amazon

I want to know if there is any way that I can crawl customer reviews for particular products from amazon without being blocked. At the moment, my crawler is blocked after a few times. Any idea will be appreciated.

Topic scraping crawling

Category Data Science


Based on my experience, you will need at least 1 proxy every 10 amazon requests, that means that if you want to crawl 1000 products, you are going to need 100 proxies to be on the safe side.

In the past I tried many services like luminati.io and proxyrack.com the problem both have is that they proxies end up being blocked and you have to get new ones which ends up being super expensive.

So I tried with proxycrawl.com which offered a price based on consumption and not on proxies and that worked much better as I don't have to care now about the amount of proxies, I just load the amazon comments.

So to be clear, if you want to use your own proxies, calculate 1 proxy every 10 products, otherwise just search for a company which can handle all that for you.


Amazon will detect the scraper from its fast and regular actions, and the same IP. Normally, scraping automation tools could skip its block by rotating the IPs and slow down the actions. Our product Octoparse Cloud Extraction could solve this problem with our hundreds of IPs and can break down the actions of the crawler to different servers.


You are getting blocked because people do not want to waste server bandwidth on someone who is trying to exploit it without bringing significant profits.

Try to make your crawling less predictable.

Slow down the frequency with which you ping the server and vary the actions of your crawler. This will make it harder to detect as it will act less predictably and may be wrongfully identified as being a very quick human.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.