LinkedIn web scraping

I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd like to get data on all employees of a given company, which you can do manually on the site but is not possible through the API.

import.io would be perfect if it recognised the LinkedIn pagination (see end of page).

Does anyone know any web scraping tools or techniques applicable to the current format of the LinkedIn site, or ways of bending the API to carry out more flexible analysis? Preferably in R or web based, but certainly open to other approaches.

Topic scraping crawling social-network-analysis data-mining

Category Data Science


Beautiful Soup is specifically designed for web crawling and scraping, but is written for python and not R


lxml is a nice web scrapping library in Python. Beautiful Soup is a wrapper over lxml. So, lxml is faster than both scrapy and beautiful soup and has a much easier learning curve.

This is an example of a scraper which I built with it for a personal project, which can iterate over web pages.


I would also go with beautifulsoup, if you know python. In case you rather code javascript/JQuery (and you are familiar with node.js), you may want to checkout CoffeeScript (Check out the Tutorial) I already used it successfully on several occasions for scraping web pages.


Scrapy is a great Python library which can help you scrape different sites faster and make your code structure better. Not all sites can be parsed with classic tools, because they can use dynamic JS content building. For this task it is better to use Selenium (This is a test framework for web sites, but it also a great web scraping tool). There's also a Python wrapper available for this library. In Google you can find a few tricks which can help you use Selenium inside Scrapy and make your code clear, organized, and you can use some great tools for Scrapy library.

I think that Selenium would be a better scraper for Linkedin than classic tools. There is a lot of javascript and dynamic content. Also, if you want to make authentication in your account and scrape all available content, you will get a lot of problems with classic authentication using simple libraries like requests or urllib.


I like rvest in combination with the SelectorGadget chrome plug-in for selecting relevant sections.

I've used rvest and built small scripts to paginate through forums by:

  1. Look for the "Page n Of m" object
  2. Extract m
  3. Based on the page structure, build a list of links from 1 to m (e.g. www.sample.com/page1)
  4. Iterate the scraper through the full list of links

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.