Using Parsel with Impit

In this guide, you'll learn how to combine the Parsel and Impit libraries when building Apify Actors.

Introduction

Parsel is a Python library for extracting data from HTML and XML documents using CSS selectors and XPath expressions. It offers an intuitive API for navigating and extracting structured data, making it a popular choice for web scraping. Compared to BeautifulSoup, it also delivers better performance.

Impit is Apify's high-performance HTTP client for Python. It supports both synchronous and asynchronous workflows and is built for large-scale web scraping, where making thousands of requests efficiently is essential. With built-in browser impersonation and anti-blocking features, it simplifies handling modern websites.

Example Actor

The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses Impit to fetch pages and Parsel to extract titles and discover new links.

import asyncio
from urllib.parse import urljoin

import impit
import parsel

from apify import Actor, Request


async def main() -> None:
    # Enter the context of the Actor.
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
        max_depth = actor_input.get('max_depth', 1)

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Open the default request queue for handling URLs to be processed.
        request_queue = await Actor.open_request_queue()

        # Enqueue the start URLs with an initial crawl depth of 0.
        for start_url in start_urls:
            url = start_url.get('url')
            Actor.log.info(f'Enqueuing {url} ...')
            new_request = Request.from_url(url, user_data={'depth': 0})
            await request_queue.add_request(new_request)

        # Create an Impit client to fetch the HTML content of the URLs.
        async with impit.AsyncClient() as client:
            # Process the URLs from the request queue.
            while request := await request_queue.fetch_next_request():
                url = request.url

                if not isinstance(request.user_data['depth'], (str, int)):
                    raise TypeError('Request.depth is an unexpected type.')

                depth = int(request.user_data['depth'])
                Actor.log.info(f'Scraping {url} (depth={depth}) ...')

                try:
                    # Fetch the HTTP response from the specified URL using Impit.
                    response = await client.get(url)

                    # Parse the HTML content using Parsel Selector.
                    selector = parsel.Selector(text=response.text)

                    # If the current depth is less than max_depth, find nested links
                    # and enqueue them.
                    if depth < max_depth:
                        # Extract all links using CSS selector
                        links = selector.css('a::attr(href)').getall()
                        for link_href in links:
                            link_url = urljoin(url, link_href)

                            if link_url.startswith(('http://', 'https://')):
                                Actor.log.info(f'Enqueuing {link_url} ...')
                                new_request = Request.from_url(
                                    link_url,
                                    user_data={'depth': depth + 1},
                                )
                                await request_queue.add_request(new_request)

                    # Extract the desired data using Parsel selectors.
                    title = selector.css('title::text').get()
                    h1s = selector.css('h1::text').getall()
                    h2s = selector.css('h2::text').getall()
                    h3s = selector.css('h3::text').getall()

                    data = {
                        'url': url,
                        'title': title,
                        'h1s': h1s,
                        'h2s': h2s,
                        'h3s': h3s,
                    }

                    # Store the extracted data to the default dataset.
                    await Actor.push_data(data)

                except Exception:
                    Actor.log.exception(f'Cannot extract data from {url}.')

                finally:
                    # Mark the request as handled to ensure it is not processed again.
                    await request_queue.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

Conclusion

In this guide, you learned how to use Parsel with Impit in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: Parsel provides excellent CSS selector and XPath support for data extraction, while Impit offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Using Parsel with Impit

Introduction​

Example Actor​

Conclusion​

Introduction

Example Actor

Conclusion