Using Crawlee

In this guide you'll learn how to use the Crawlee library in your Apify Actors.

Introduction

Crawlee is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like HttpCrawler, BeautifulSoupCrawler and ParselCrawler, and browser-based crawlers like PlaywrightCrawler, to suit different scraping needs.

In this guide, you'll learn how to use Crawlee with BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler to build Apify Actors for web scraping.

Actor with BeautifulSoupCrawler

The BeautifulSoupCrawler is ideal for extracting data from static HTML pages. It uses BeautifulSoup for parsing and ImpitHttpClient for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, BeautifulSoupCrawler is a great choice for your scraping tasks. Below is an example of how to use it` in an Apify Actor.

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

from apify import Actor

# Create a crawler.
crawler = BeautifulSoupCrawler(
    # Limit the crawl to max requests. Remove or increase it for crawling all links.
    max_requests_per_crawl=50,
)


# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    Actor.log.info(f'Scraping {context.request.url}...')

    # Extract the desired data.
    data = {
        'url': context.request.url,
        'title': context.soup.title.string if context.soup.title else None,
        'h1s': [h1.text for h1 in context.soup.find_all('h1')],
        'h2s': [h2.text for h2 in context.soup.find_all('h2')],
        'h3s': [h3.text for h3 in context.soup.find_all('h3')],
    }

    # Store the extracted data to the default dataset.
    await context.push_data(data)

    # Enqueue additional links found on the current page.
    await context.enqueue_links(strategy='same-domain')


async def main() -> None:
    # Enter the context of the Actor.
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = [
            url.get('url')
            for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])
        ]

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Run the crawler with the starting requests.
        await crawler.run(start_urls)


if __name__ == '__main__':
    asyncio.run(main())

Actor with ParselCrawler

The ParselCrawler works in the same way as BeautifulSoupCrawler, but it uses the Parsel library for HTML parsing. This allows for more powerful and flexible data extraction using XPath selectors. It should be faster than BeautifulSoupCrawler. Below is an example of how to use ParselCrawler in an Apify Actor.

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext

from apify import Actor

# Create a crawler.
crawler = ParselCrawler(
    # Limit the crawl to max requests. Remove or increase it for crawling all links.
    max_requests_per_crawl=50,
)


# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
    Actor.log.info(f'Scraping {context.request.url}...')

    # Extract the desired data.
    data = {
        'url': context.request.url,
        'title': context.selector.xpath('//title/text()').get(),
        'h1s': context.selector.xpath('//h1/text()').getall(),
        'h2s': context.selector.xpath('//h2/text()').getall(),
        'h3s': context.selector.xpath('//h3/text()').getall(),
    }

    # Store the extracted data to the default dataset.
    await context.push_data(data)

    # Enqueue additional links found on the current page.
    await context.enqueue_links(strategy='same-domain')


async def main() -> None:
    # Enter the context of the Actor.
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = [
            url.get('url')
            for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])
        ]

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Run the crawler with the starting requests.
        await crawler.run(start_urls)


if __name__ == '__main__':
    asyncio.run(main())

Actor with PlaywrightCrawler

The PlaywrightCrawler is built for handling dynamic web pages that rely on JavaScript for content rendering. Using the Playwright library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use PlaywrightCrawler in an Apify Actor.

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

from apify import Actor

# Create a crawler.
crawler = PlaywrightCrawler(
    # Limit the crawl to max requests. Remove or increase it for crawling all links.
    max_requests_per_crawl=50,
    # Run the browser in a headless mode.
    headless=True,
    browser_launch_options={'args': ['--disable-gpu']},
)


# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    Actor.log.info(f'Scraping {context.request.url}...')

    # Extract the desired data.
    data = {
        'url': context.request.url,
        'title': await context.page.title(),
        'h1s': [await h1.text_content() for h1 in await context.page.locator('h1').all()],
        'h2s': [await h2.text_content() for h2 in await context.page.locator('h2').all()],
        'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()],
    }

    # Store the extracted data to the default dataset.
    await context.push_data(data)

    # Enqueue additional links found on the current page.
    await context.enqueue_links(strategy='same-domain')


async def main() -> None:
    # Enter the context of the Actor.
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = [
            url.get('url')
            for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])
        ]

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Run the crawler with the starting requests.
        await crawler.run(start_urls)


if __name__ == '__main__':
    asyncio.run(main())

Conclusion

In this guide, you learned how to use the Crawlee library in your Apify Actors. By using the BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Using Crawlee

Introduction​

Actor with BeautifulSoupCrawler​

Actor with ParselCrawler​

Actor with PlaywrightCrawler​

Conclusion​

Introduction

Actor with BeautifulSoupCrawler

Actor with ParselCrawler

Actor with PlaywrightCrawler

Conclusion