Web Scraping - a scientific database

I am searching a scientific database for abstracts of papers containing the words project management. Here is the link:

For getting abstracts, I need to click on any paper and open a new page. How can I do that for 68 papers? I program in R and bash.

Topic scraping crawling r

Category Data Science


Another workaround is to get the listing by POST requests using curl in bash.

You can get the curl post statement from Firebug ( Firefox F12 ) under Network , filter for XHR requests and copy the last statement which requests SearchPaper.aspx?str=project+management (right-click -> copy curl-adress).

In this post request statement you have to increase the parameter ctl00$ContentPlaceHolder1$txtPageNo to a desired pagination number (1-6 in this case).

Then parse the output to a static xml parsing tool to get your data.


try RSelenium. with phantomjs since the date is requested and filled in by ajax calls. so any static web scraping tools wont work.

I managed to get the list on the first page.

http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html

sample of what i managed to pull.

remove( mopub, m, run , rx, x , first1)
library(RSelenium)
pjs<- phantom( pjs_cmd="C:/Users/bhavin.patel/Downloads/phantomjs-2.0.0-    windows/bin/phantomjs.exe")
Sys.sleep(5)
remDr <- remoteDriver( browserName = 'PhantomJS')
dsurl <- "http://en.journals.sid.ir/SearchPaper.aspxstr=project%20management"
remDr$open()
remDr$navigate(dsurl)
allt3 <-remDr$findElements('id', 'Table3')
lapply( allt3 , FUN=function(dst){ dst$getElementText(); })

[[1]]
[[1]][[1]]
[1] " 1 :   EFFECTIVE FACTORS ON RURAL PEOPLE’S NON-PARTICIPATION OF     MAHABAD’S DAM CATCHMENT IN WATERSHED MANAGEMENT PROJECTS\nAuthor(s): RASOULIAZAR SOLEIMAN*,FEALY SAEID\nJournal: INTERNATIONAL JOURNAL OF AGRICULTURAL MANAGEMENT AND DEVELOPMENT (IJAMAD)\nNumber: MARCH 2015 , Volume  5 , Number  1 ; Page(s) 19 To 26.\nKeyword(s): NON-PARTICIPATION, CATCHMENT, WATERSHED MANAGEMENT, MAHABAD TOWNSHIP, IRAN\nReference(s):  (0)      Citation(s):  (0) FullText:"

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.