web-scraping

How to arrange web scraped data in a table using R?

user151030

2022年5月27日 08:10

Original Code library(netstat) library(RSelenium) library(tidyverse) obj<-rsDriver(browser="chrome",chromever="101.0.4951.15",verbose=F,port=free_port()) remDr<-obj$client remDr$navigate('https://www.imdb.com/search/title/?year=2022&title_type=feature&') Title<-remDr$findElements(using='css','.lister-item-header a') lapply(Title,function(x) { x$getElementText()%>% unlist() }) o/p: [[1]] 1 "Doctor Strange in the Multiverse of Madness" [[2]] 1 "Senior Year" My attempts to arrange data in tabular form- 1.movies=data.frame(Title,stringsAsFactors=FALSE) view(movies) **Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘structure("webElement", package = "RSelenium")’ to a data.frame** 2.movies=data.frame(x,stringsAsFactors=FALSE) view(movies) **Error in data.frame(X, stringsAsFactors = FALSE) : object 'X' not found** 3.Part of original code tweaked- lapply(Title,function(x) { **t<-list(x$getElementText()%>% unlist())** }) l=data.frame("movie"=t,stringsAsFactors …

Topic: structured-data web-scraping visualization r

Category: Data Science

What kind of hypothesis testing in Python can be used to validate that 4 job titles are significantly different based on their skillset?

Justin Schmidt

2022年4月18日 12:41

I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …

Topic: web-scraping hypothesis-testing scipy python categorical-data

Category: Data Science

Use Python to Scape Data on website ends with aspx

Mr. Z

2022年3月15日 03:01

https://blue.kingcounty.com/Assessor/eRealProperty/default.aspx This is the website I want to scrape. For example, I input 500 4TH AVE I just want to get Parcel Number and Lot size. I was trying to edit the link https://blue.kingcounty.com/Assessor/eRealProperty/search=5004THAVE.aspx I think it did not work, please demonstrate. Thank you.

Topic: web-scraping

Category: Data Science

Extracting Public Page Posts from Facebook

Reed

2022年2月24日 15:44

I am a data science student working on my capstone research project. I am trying to collect posts from a number of public organization pages from facebook. I am looking at the graphs api and it does not appear to have an endpoint for this use case. The page/feed requires either moderation access of the pages(which i do not have) or public page read access which requires a business application. The CrowdTangle process is not accepting researchers outside of very …

Topic: web-scraping data api

Category: Data Science

Beautifulsoup iterating through scraped data

Steven

2022年1月29日 15:04

I have this html code that is repeating multiple times: <div class="Company_line-logo image-loader-target" data-image-loader-height="47" data-image-loader-height-mobile="47" data-image-loader-src="/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo.png" data-image-loader-src-mobile="/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo_mobile.png" data-image-loader-width="189" data-image-loader-width-mobile="189" style='background-image:url("http://en.50partners.fr/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo.png");'></div> <h5 class="Company_line-title">Woleet</h5> <div class="Company_line-description"> By using: for blocks in soup: block = soup.find('a', class_='Company_line logo-contains-name').find('h5').get_text() I can get what I want that is "Woleet" in between the h5 tags. I tried to iterate this to get all of them: block = soup.find('a', class_='Company_line logo-contains-name') for name in block: names = block.find_all('h5') it returns only 1 h5 class="Company_line-title">Woleet</h5 whereas I should …

Topic: web-scraping python

Category: Data Science

Is there a way to scrape tweets in realtime from a list of specified users?

niusoski

2021年10月18日 20:37

I am trying to build a scraper that will run continuously and save the tweets from a list of users instantaneously or within seconds of the user tweeting it. It could save the tweet details to a continuously updated csv file.

Topic: web-scraping scraping crawling data-mining

Category: Data Science

how to scrape images from facebook group

Mohamed Amine

2021年10月8日 08:29

I want to scrape images from several Facebook groups some of them are public and some or not I am new to web scraping but I tried to look for solutions with selenium or BeautifulSoup or scrappy but I didn't find anything

Topic: web-scraping data scraping dataset

Category: Data Science

How to Scrape Website Usage/User Data for a Website that Does Not Have an API

Nat

2021年9月30日 16:55

I am an aspiring Data Analyst and I was wondering how do I web scrape website usage/user data such as: total people who visited that website, total visitors per page, how long did each visitor stay on each page, etc. The website I am interested in scraping does not have an API in case that effects your answer. I have searched online, watched many YouTube tutorials, etc., but it doesn't pertain to the data I need. I don't know the …

Topic: web-scraping data-science-model data python

Category: Data Science

Data extraction using crawlers

Jay

2021年9月27日 14:26

I have a rather simple data scraping task, but my knowledge of web scraping is limited. I have a excel file containing the names of 500 cities in a column, and I'd like to find their distance from a fixed city, say Montreal. I have found this website which gives the desired distance (in both km and miles). For each of these 500 cities, I'd like to read the name in the excel file, enter it in the "to" box, …

Topic: information-extraction web-scraping crawling

Category: Data Science

Find business vertical of a website just by its URL or cluster similar website by its url

think-maths

2021年9月23日 10:44

I have been exploring this problem a lot about just using the website url to tag or cluster them as per their business domain. For example: amazon.com => e-commerce bbc.co.uk => news Adidas.com => sports apparel I have read through some research papers which try to cluster using different unsupervised learning clustering algorithm like CLUE link here One way to think is to create a repository of labeled websites and then create a model to tag similar websites using this …

Topic: web-scraping python-3.x unsupervised-learning information-retrieval clustering

Category: Data Science

Suggestions for studying Clickstream data

user1147964

2021年6月18日 14:39

I've essentially been handed a dataset of website access history and I'm trying to draw some conclusions from it. The data supplied gives me the web URL, the datetime for when it was accessed, an the unique ID of the user accessing that data. This means that for a given user ID, I can see a timeline of how they went through the website and what pages they looked at. I'd quite like to try clustering these users into different …

Topic: web-scraping markov-hidden-model markov-process

Category: Data Science

looking for public dataset for stock market

Donald S

2021年4月21日 20:26

I want to do some modelling and data visualization on historical stock data, including price, volume, financials, etc. Is there an public dataset available for stock price history? I looked at a few, but either they have a high cost, or not sure they would be reliable. Free would be preferred, also established and reliable. If not, what are some good options for collecting the data myself? Maybe web scraping, or public api's etc.

Topic: web-scraping visualization dataset machine-learning

Category: Data Science

Good database/plug-in to scrape for academic paper info?

Mox

2021年2月25日 06:28

I am trying to scrape the web to find information about a set of academic papers. Unfortunately, for many of the papers, I only have the author's name, a year, and part of a title. (For example, [BANDURA A, 1997, SELF EFFICACY EXERCI]) I have tried using a web driver to look up the papers on Scopus but many of the papers are not there. Does anyone know of a good academic paper database and/or plug-in that I could search …

Topic: web-scraping python

Category: Data Science

Best way to store scraped web pages?

HelloGoodbye

2020年9月7日 07:41

I'm going to scrape the HTML code from a large number of ULRs and store them on my computer, for machine learning purposes (basically, I'm going to use Python and PyTorch to train a neural network on this data). What is the best way to store the HTML code for all the web pages? I want to be able to see which URLs I have already scraped, so that I don't have to scrape them again, and for each piece …

Topic: web-scraping

Category: Data Science

Beautifulsoup find_all returns TypeError: list indices must be integers or slices, not str

Steven

2020年7月30日 11:21

This should have an easy solution but I can't understand how to avoid it. content1 = soup.find('div', class_='fusion-text fusion-text-6') content2 = soup.find('div', class_='fusion-text fusion-text-7') for para in content1: comp_name = para.find_all('a')['href'] print(comp_name) The error comes because I have a list at ['href'] together with find_all. comp_name = para.find('a')['href'] doesn't return an error and gives the right output (an URL), but just the first one. Since I want to scrape all of them inside my content1 I wanted to use find_all …

Topic: web-scraping error-handling python

Category: Data Science

Python writing to Excel file: writerow() takes no keyword arguments

Steven

2020年7月27日 21:07

I have this script: import requests from requests import get from bs4 import BeautifulSoup import csv import pandas as pd f = open('olanda.csv', 'wb') writer = csv.writer(f) url = ('https://www......') response = get(url) soup = BeautifulSoup(response.text, 'html.parser') type(soup) table = soup.find('table', id='tablepress-94').text.strip() print(table) writer.writerow(table.split(), delimiter = ',') f.close() When it writes to a CSV file it writes everything in a single cell like that: Sno.,CompanyLocation,1KarifyNetherlands,2Umenz,Benelux,BVNetherlands,3TovertafelNetherlands,4Behandeling,BegrepenNetherlands,5MEXTRANetherlands,6Sleep.aiNetherlands,7OWiseNetherlands,8Healthy,WorkersNetherlands,9&thijs,|,thuis,in,jouw,situatieNetherlands,10HerculesNetherlands, etc. I wanted to have the output in a single column and each value (separated …

Topic: web-scraping error-handling python

Category: Data Science

Web scraping using Beatiful Soup

IKRAM EL MBARKI

2020年7月2日 13:07

I have this code and I wanna extract holidays, petrol and temperature but I don't know where is the problem. I need your help as soon as possible, please. I want to add this extraction to my dataset that is based on date columns, so comparing the scraping data with the dates that I have in my dataset. I also wanna test the impact of each variable holidays, temperature... import requests import re import json import datefinder from googletrans import …

Topic: web-scraping forecasting time-series python machine-learning

Category: Data Science

Scraping financial web data

JamieA

2020年6月6日 02:07

I recently started working as a data scientist and I am starting a web scraping and NLP project using Python. The idea is to create a program that searches for public information on the company's clients. These information can come from various sources: annual reports, income statements, articles.... I will have to deal with two types of formats: HTML and PDFs. For now I will focus on retrieving the revenue of the company. After a month of research and tests, …

Topic: web-scraping nlp data-mining

Category: Data Science

Web Scraping: Multiple small files or one large file?

thomasjeeverson

2020年5月26日 14:22

I plan to scrape some forums (Reddit, 4chan) for a research project. We will scrape the newest posts, every 10 minutes for around 3 months. I am wondering how best to store the JSON data from each scrape, so that pre-processing (via Python) later would be as simple as possible. My options are the following: Dump data from each scrape into a fresh file (timestamp as filename). Resulting in 12,960 files of approx. 150kb each OR Maintain 1 single large …

Topic: web-scraping preprocessing nlp python

Category: Data Science

Scraping mixed elements and passing to SQL

thesimplevoodoo

2020年5月4日 15:08

I'm running webscrapes via python which are retrieving data from csv's hosted on the web. I'd like to pass the data into a MSSQL database. An issue I have is the mixed elements/data types in the csv. Here is an example of the data Item Val1 Val2 A 100 200 B 101 201 C Null -2/2(%) D Null 2019-Nov-18 I would like to import all of this data into the db, but the critical data is in the "Val2" column. …

Topic: web-scraping python-3.x sql

Category: Data Science

About