Using Crawlee
In this guide you'll learn how to use the Crawlee library in your Apify Actors.
Introduction
Crawlee is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like HttpCrawler
, BeautifulSoupCrawler
and ParselCrawler
, and browser-based crawlers like PlaywrightCrawler
, to suit different scraping needs.
In this guide, you'll learn how to use Crawlee with BeautifulSoupCrawler
, ParselCrawler
, and PlaywrightCrawler
to build Apify Actors for web scraping.
Actor with BeautifulSoupCrawler
The BeautifulSoupCrawler
is ideal for extracting data from static HTML pages. It uses BeautifulSoup for parsing and ImpitHttpClient
for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, BeautifulSoupCrawler
is a great choice for your scraping tasks. Below is an example of how to use it` in an Apify Actor.
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from apify import Actor
# Create a crawler.
crawler = BeautifulSoupCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=50,
)
# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
Actor.log.info(f'Scraping {context.request.url}...')
# Extract the desired data.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
'h1s': [h1.text for h1 in context.soup.find_all('h1')],
'h2s': [h2.text for h2 in context.soup.find_all('h2')],
'h3s': [h3.text for h3 in context.soup.find_all('h3')],
}
# Store the extracted data to the default dataset.
await context.push_data(data)
# Enqueue additional links found on the current page.
await context.enqueue_links(strategy='same-domain')
async def main() -> None:
# Enter the context of the Actor.
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [
url.get('url')
for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])
]
# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
# Run the crawler with the starting requests.
await crawler.run(start_urls)
if __name__ == '__main__':
asyncio.run(main())
Actor with ParselCrawler
The ParselCrawler
works in the same way as BeautifulSoupCrawler
, but it uses the Parsel library for HTML parsing. This allows for more powerful and flexible data extraction using XPath selectors. It should be faster than BeautifulSoupCrawler
. Below is an example of how to use ParselCrawler
in an Apify Actor.
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from apify import Actor
# Create a crawler.
crawler = ParselCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=50,
)
# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
Actor.log.info(f'Scraping {context.request.url}...')
# Extract the desired data.
data = {
'url': context.request.url,
'title': context.selector.xpath('//title/text()').get(),
'h1s': context.selector.xpath('//h1/text()').getall(),
'h2s': context.selector.xpath('//h2/text()').getall(),
'h3s': context.selector.xpath('//h3/text()').getall(),
}
# Store the extracted data to the default dataset.
await context.push_data(data)
# Enqueue additional links found on the current page.
await context.enqueue_links(strategy='same-domain')
async def main() -> None:
# Enter the context of the Actor.
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [
url.get('url')
for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])
]
# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
# Run the crawler with the starting requests.
await crawler.run(start_urls)
if __name__ == '__main__':
asyncio.run(main())
Actor with PlaywrightCrawler
The PlaywrightCrawler
is built for handling dynamic web pages that rely on JavaScript for content rendering. Using the Playwright library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use PlaywrightCrawler
in an Apify Actor.
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from apify import Actor
# Create a crawler.
crawler = PlaywrightCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=50,
# Run the browser in a headless mode.
headless=True,
browser_launch_options={'args': ['--disable-gpu']},
)
# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
Actor.log.info(f'Scraping {context.request.url}...')
# Extract the desired data.
data = {
'url': context.request.url,
'title': await context.page.title(),
'h1s': [await h1.text_content() for h1 in await context.page.locator('h1').all()],
'h2s': [await h2.text_content() for h2 in await context.page.locator('h2').all()],
'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()],
}
# Store the extracted data to the default dataset.
await context.push_data(data)
# Enqueue additional links found on the current page.
await context.enqueue_links(strategy='same-domain')
async def main() -> None:
# Enter the context of the Actor.
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [
url.get('url')
for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])
]
# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
# Run the crawler with the starting requests.
await crawler.run(start_urls)
if __name__ == '__main__':
asyncio.run(main())
Conclusion
In this guide, you learned how to use the Crawlee library in your Apify Actors. By using the BeautifulSoupCrawler
, ParselCrawler
, and PlaywrightCrawler
crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!