Web Scraping

Web scraping is the process of automatically extracting information from websites. In Python, we use requests to download pages and BeautifulSoupfrom the bs4 package to parse them.

1. Installation

pip install requests beautifulsoup4

2. The Basic Scraper

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.text) # "Example Domain"

# Find all links
links = soup.find_all("a")
for link in links:
    print(link.get("href"))

3. Ethical Scraping

  • Check robots.txt: Respect the site's scraping policy.
  • Don't Hammer Servers: Use time.sleep() between requests.
  • User-Agent: Set a header so the server knows who you are.
Pro Tip: For complex sites that use JavaScript to render content (like React apps), you might need Selenium or Playwright.