Web Scraping Using Python Selenium and Beautiful Soup
I recently got involved in a project requiring Web Scrapping for the purposes of Product documentation, Pricing, Reviews, etc..
Below I am going to describe a methods that is widely used for web scrapping. the process below uses Selenium a Python Module to retrieve web information from Walmart.
First we need to make sure the Python interpreter is installed.
If on Windows – Open CMD in windows, and just type python and hit enter, this should take you to the windows store (if not installed), just click get python & install (usually on top right corner) – you can skip this if already installed.
If on Linux just run sudo apt-get install python.
Next lets install the required python modules. create a file requirements.txt with the below contents.
requests>=2.28.1
selenium>=4.7.2
beautifulsoup4>=4.11.1
lxml>=4.9.2
Next, lets install the required Python modules. To install the required python modules, run the below.
pip install -r requirements.txt
We are almost redye for some action. Lets create our python web scrapper script.
Create a file with the below content.
Do you need help with this configuration ? Just let us know.
Use the Contact Form to get in touch with us for a free Consultation.
#!/usr/bin/env python
import sys, os
import json
import requests
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
# url = 'https://www.walmart.com/ip/FUBU-Men-s-Zone-Basketball-High-top-Sneakers/439361264/'
url = 'https://www.walmart.com/ip/Hisense-58-Class-4K-UHD-LED-LCD-Roku-Smart-TV-HDR-R6-Series-58R6E3/587182688/'
# User agent for Linux
# 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
# Headers
headers = {
'authority': 'www.walmart.com',
'method': 'GET',
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'application/json',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
def get_cookies(selenium_cookies):
cookies = {}
for cookie in selenium_cookies:
cookies[cookie['name']] = cookie['value']
return cookies
driver = webdriver.Chrome()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.headless = True
driver = webdriver.Chrome(options=chrome_options)
driver.create_options()
driver.get(url)
selenium_cookies = driver.get_cookies()
cookies = get_cookies(selenium_cookies)
# print(cookies)
# driver.close()
driver.quit()
r = requests.request("GET", url, headers=headers, cookies=cookies)
# save the content to a file
# with open('walmart_data.html', 'w') as f:
# print(r.text, file=f)
soup = BeautifulSoup(r.text, 'lxml')
data = json.loads(soup.find('script', type='application/ld+json').text)
# print(soup.prettify())
print('Price:', data['offers']['price'])
The Python script uses selenium with a single URL/item (this can also be randomized if required) to retrieve a workable cookie which is checkout by some web sites (like Walmart). You can then use a proxy service to avoid getting blocked while web crawling/scrapping. I listed below some of the API proxy vendors below.
Another option I found was using a service like BlueCart API to retrieve Walmart data, but might be quite a bit more expansive then the other choices.
Do you need help with this configuration ? Just let us know.
Like this article. Please provide feedback, or let us know in the comments below.
Use the Contact Form to get in touch with us for a free Consultation.