≡

Blog

For solo engineers, but not only.

Scalable Web Scraping with Serverless - Part 1

April 27, 2024•Lev Gelfenbuim•12 min. read

Today I'll show you how to build a scalable infrastructure for web scraping using Serverless Framework.

Serverless computing offers a compelling solution by eliminating the need to manage infrastructure, allowing developers to focus solely on code. This model is perfect for web scraping tasks that vary in intensity and frequency, as it scales automatically to meet demand without any manual intervention. Moreover, serverless functions are cost-effective, as you only pay for the compute time you consume, making it an ideal choice for both small-scale projects and large-scale data extraction tasks.

In this blog series, we will explore how to leverage serverless technologies to build a robust and scalable web scraping infrastructure. We'll use a suite of AWS services including Lambda, S3, SQS, and RDS, combined with popular Node.js libraries like node-fetch for fetching data, cheerio for parsing HTML, and node-postgres to interact with databases.

Our journey will be split into two parts:

Part 1: We'll set up our serverless environment, deploy functions to fetch and store web data, and orchestrate our components using AWS services.
Part 2: We will transform our raw HTML data into structured JSON and seamlessly load this data into a PostgreSQL database for further analysis.

Understanding Serverless Architecture

What is Serverless Computing?

Serverless computing is a cloud computing execution model where the cloud provider manages the setup, capacity planning, and server management for you. Essentially, it allows developers to write and deploy code without worrying about the underlying infrastructure. In a serverless setup, you only pay for the compute time you consume—there is no charge when your code is not running.

Key Benefits of Serverless for Web Scraping

Automatic Scaling: Serverless functions automatically scale based on the demand. For web scraping, this means you can handle virtually any number of pages or sites without needing to manually adjust the capacity of servers.
Cost-Effectiveness: With serverless, you pay per execution. You aren’t charged for idle resources. This can be incredibly cost-efficient compared to running dedicated servers 24/7, especially for scraping tasks that might only need to run a few times a day or week.
Reduced Management Overhead: The serverless model offloads much of the operational burden to the cloud provider. Maintenance tasks like server provisioning, patching, and administration are handled by the provider, freeing you to focus on improving your scraping logic and handling data.

How Serverless Fits Into Web Scraping

Serverless is particularly adept at handling event-driven applications—like those triggered by a scheduled time to scrape data or a new item appearing in a queue. This fits naturally with the often sporadic and on-demand nature of web scraping, where workloads can be highly variable and unpredictable.

Popular Serverless Platforms

While AWS Lambda is one of the most popular serverless computing services, other cloud providers offer similar functionalities:

Azure Functions: Microsoft's equivalent to AWS Lambda, offering easy integration with other Azure services.
Google Cloud Functions: A lightweight, event-driven computing solution that can automatically scale based on the workload.
IBM Cloud Functions: Based on Apache OpenWhisk, IBM’s offering also supports serverless computing within their cloud ecosystem.

Each platform has its own strengths and pricing models, but AWS Lambda will be our focus due to its maturity, extensive documentation, and seamless integration with other AWS services like S3 and SQS.

Configuring the Serverless Framework

After gaining a foundational understanding of serverless architecture, our next step involves setting up the Serverless Framework. This powerful framework simplifies deploying serverless applications and managing their lifecycle. In this section, we'll cover how to configure the Serverless Framework to work seamlessly with AWS services, a crucial component for our scalable web scraping infrastructure.

Introduction to the Serverless Framework

The Serverless Framework is an open-source CLI that provides developers with a streamlined workflow for building and deploying serverless applications. It abstracts much of the complexity associated with configuring various cloud services, making it easier to launch applications across different cloud providers like AWS, Azure, and Google Cloud.

Installing the Serverless Framework

To begin, you need to have Node.js installed on your machine as the Serverless Framework runs on it. Once Node.js is set up, you can install the Serverless Framework globally using npm:

1npm install -g serverless

Or, if you prefer using Yarn:

1yarn global add serverless

Configuring AWS Credentials

To deploy functions to AWS Lambda and manage resources, you need to configure your AWS credentials in the Serverless Framework:

Create an AWS IAM User: Log in to your AWS Management Console and navigate to the IAM service. Create a new user with programmatic access. Attach the AdministratorAccess policy to this user for now, which grants the necessary permissions. For production, you should customize the permissions to follow the principle of least privilege.
Configure Credentials: After creating your IAM user, you will receive an access key ID and secret access key. Use these credentials to configure the Serverless Framework by running the following command:

1serverless config credentials --provider aws --key YOUR_ACCESS_KEY_ID --secret YOUR_SECRET_ACCESS_KEY

This command stores your credentials in a local file, which the Serverless Framework uses to interact with your AWS account.

Setting Up serverless.yml

Every Serverless Framework project contains a serverless.yml file at its root. This file is crucial as it defines the service configuration, including functions, events, and resources. Here’s a basic setup for our web scraping project:

1service: web-scraping-service
2
3provider:
4  name: aws
5  runtime: nodejs14.x
6  region: us-east-1
7
8functions:
9  scrapeSite:
10    handler: handler.scrape
11    events:
12      - schedule:
13          rate: cron(0 */4 * * ? *)  # Runs every 4 hours

In this configuration:

service: Defines the name of your serverless application.
provider: Specifies AWS as the provider and sets the runtime environment and region.
functions: Lists the functions to deploy. In this example, scrapeSite is triggered by a scheduled event.

Extending the Serverless Configuration for SQS

To fully harness the power of serverless architecture for our web scraping infrastructure, integrating Amazon Simple Queue Service (SQS) is essential. SQS will manage the messages related to tasks such as notifying our system when new data is available for processing. Here's how to extend our existing serverless.yml configuration to include SQS resources:

Adding SQS to the Serverless Framework Configuration

SQS queues can be defined and managed directly within the serverless.yml file, which allows for seamless integration and management within our AWS ecosystem. Let’s add two SQS queues to our setup—one for HTML files and another for JSON data:

1service: web-scraping-service
2
3provider:
4  name: aws
5  runtime: nodejs14.x
6  region: us-east-1
7  iamRoleStatements:
8    - Effect: Allow
9      Action:
10        - sqs:SendMessage
11        - sqs:ReceiveMessage
12        - sqs:DeleteMessage
13        - sqs:GetQueueAttributes
14      Resource: "*"
15
16resources:
17  Resources:
18    HtmlQueue:
19      Type: AWS::SQS::Queue
20      Properties:
21        QueueName: htmlQueue
22
23functions:
24  scrapeSite:
25    handler: handler.scrape
26    events:
27      - schedule:
28          rate: cron(0 */4 * * ? *)  # Runs every 4 hours
29    environment:
30      HTML_QUEUE_URL: !Ref HtmlQueue

Explanation of the Configuration:

IAM Role Statements: Specifies the permissions for the Lambda functions to interact with SQS, allowing them to send, receive, and delete messages.
Resources: Defines the SQS queue. The HtmlQueue is used to store messages that contain references to HTML files stored in S3.
Functions:
- scrapeSite: This function is triggered on a schedule to scrape websites and push references to the HtmlQueue.

Creating Your First Scraping Function with AWS Lambda

With the Serverless Framework configured and ready, the next step in building our scalable web scraping infrastructure is to develop the first AWS Lambda function. This function will be responsible for fetching web data periodically based on the sitemap of a target website. Let's dive into creating and deploying this essential component.

Understanding AWS Lambda

AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the compute resources for you. Lambda is ideal for handling tasks that respond to HTTP requests, process queue messages, or, as in our case, execute tasks on a schedule.

Setting Up the Lambda Function

Initialize a Serverless Project: Begin by creating a new directory for your project and navigate into it. Use the Serverless CLI to create a new service:

1serverless create --template aws-nodejs --path my-web-scraper
2cd my-web-scraper

This command sets up a basic Node.js project with a serverless.yml file and a sample handler function in handler.js.

Install Dependencies: For fetching and parsing the sitemap, we'll use node-fetch for HTTP requests and xml2js for converting XML data into a JSON object.

1npm install node-fetch xml2js

Writing the Lambda Function

Create a new function in the handler.js file or whatever your main file is named. This function will handle the fetching of the sitemap and parsing it to get URLs:

1exports.scrape = async () => {
2  try {
3    const sitemapUrl = 'https://example.com/sitemap.xml';
4    const response = await fetch(sitemapUrl);
5    if (!response.ok) {
6      throw new Error(`Failed to fetch sitemap: ${response.statusText}`);
7    }
8    const sitemapXml = await response.text();
9
10    const parsedSitemap = await xml2js.parseStringPromise(sitemapXml);
11    const urls = parsedSitemap.urlset.url.map(u => u.loc[0]);
12
13    for (const url of urls) {
14      const params = {
15        MessageBody: JSON.stringify({ url }),
16        QueueUrl: process.env.HTML_QUEUE_URL,
17      };
18
19      await sqs.sendMessage(params).promise();
20    }
21  } catch (error) {
22    console.error('Error during scraping process:', error);
23    // Optionally rethrow or handle error specifically (e.g., retry logic, notification)
24  }
25};

Key Components of the Function

Fetching the Sitemap: The function starts by fetching the sitemap.xml using node-fetch, a lightweight module suited for such tasks.
Parsing XML: The sitemap XML is parsed into a JavaScript object using xml2js, which allows easy access to the URLs listed in the sitemap.
Pushing URLs to SQS: Each URL is then formatted into a message and sent to the SQS queue designated for HTML page processing.

Error Handling

To enhance the reliability of your Lambda function, especially in a web scraping context where external dependencies like network issues or site changes can lead to errors, error handling is a must. Here’s how you can implement comprehensive error handling in your serverless function:

Error Monitoring and Logging

Integrate error monitoring and logging tools to capture errors for further analysis. Tools like AWS CloudWatch can be configured to monitor logs, which helps in diagnosing issues after deployment.

1provider:
2  name: aws
3  runtime: nodejs14.x
4  region: us-east-1
5  iamRoleStatements:
6    # IAM permissions here
7  logs:
8    restApi:
9      loggingLevel: ERROR
10      fullExecutionData: true

Dead Letter Queues (DLQ)

Configure Dead Letter Queues (DLQ) in SQS for messages that cannot be processed after several attempts. This approach helps in isolating problematic messages and prevents them from clogging your processing pipeline.

1resources:
2  Resources:
3    HtmlQueue:
4      Type: AWS::SQS::Queue
5      Properties:
6        QueueName: htmlQueue
7        RedrivePolicy:
8          deadLetterTargetArn: !GetAtt HtmlDeadLetterQueue.Arn
9          maxReceiveCount: 3  # Adjust based on your needs
10
11    HtmlDeadLetterQueue:
12      Type: AWS::SQS::Queue
13      Properties:
14        QueueName: htmlDeadLetterQueue

Timeout and Memory Management

Properly manage timeouts and memory settings in your Lambda function to prevent unexpected terminations or performance issues.

1functions:
2  scrapeSite:
3    handler: handler.scrape
4    timeout: 30  # seconds
5    memorySize: 256  # MB
6    events:
7      - schedule:
8          rate: cron(0 */4 * * ? *)  # Runs every 4 hours

Testing and Deployment

Before deploying your function, test it locally or in a development environment to ensure it behaves as expected. The Serverless Framework provides commands for invoking functions locally and deploying them to AWS.

Local Testing

1serverless invoke local --function scrapeSite

Deployment

1serverless deploy

Storing Web Data in AWS S3

After successfully setting up and deploying our Lambda function to scrape web data, the next crucial step in our serverless web scraping pipeline involves efficiently storing the retrieved data. Amazon S3 (Simple Storage Service) offers a robust solution for this purpose, providing scalability, data availability, security, and performance. This section will guide you through setting up S3 buckets and configuring your Lambda function to save scraped data directly to S3.

Why Choose Amazon S3?

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Some key benefits include:

Durability and Availability: S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements. It gives you flexibility in the way you manage data for cost optimization, access control, and compliance.
Scalability: Automatically scale your storage without worrying about the underlying infrastructure. This is crucial for web scraping applications where the amount of data can grow unpredictably.
Cost-Effectiveness: With S3, you pay only for the storage you use. There are no minimum fees or setup costs, which makes it a cost-effective solution for storing web scraped data.

Setting Up S3 Buckets

Before you can store data, you need to create an S3 bucket in your AWS account. Each bucket's name must be unique across all existing bucket names in Amazon S3 (bucket names are shared among all users globally).

Create a Bucket:
- Go to the AWS Management Console.
- Navigate to S3 and select “Create bucket”.
- Provide a unique bucket name, e.g., web-scraping-data-yourname.
- Select the AWS region where you want the bucket to reside.
- Leave the default settings or configure options like versioning or logging based on your specific requirements.
Bucket Policy:
- To define the bucket policy directly in your serverless.yml configuration, you can use AWS CloudFormation resources that the Serverless Framework supports. Here's how to include a bucket policy that specifically allows a Lambda function to put and get objects in an S3 bucket:

1service: web-scraping-service
2
3provider:
4  name: aws
5  runtime: nodejs14.x
6  region: us-east-1
7  iamRoleStatements:
8    - Effect: "Allow"
9      Action:
10        - "s3:ListBucket"
11      Resource: "arn:aws:s3:::web-scraping-data-yourname"
12    - Effect: "Allow"
13      Action:
14        - "s3:PutObject"
15        - "s3:GetObject"
16      Resource: "arn:aws:s3:::web-scraping-data-yourname/*"
17
18resources:
19  Resources:
20    WebScrapingDataBucket:
21      Type: AWS::S3::Bucket
22      Properties:
23        BucketName: web-scraping-data-yourname
24
25    BucketPolicy:
26      Type: AWS::S3::BucketPolicy
27      Properties:
28        Bucket:
29          Ref: WebScrapingDataBucket
30        PolicyDocument:
31          Statement:
32            - Effect: "Allow"
33              Principal:
34                AWS: 
35                  Fn::GetAtt: [LambdaExecutionRole, Arn]
36              Action:
37                - "s3:PutObject"
38                - "s3:GetObject"
39              Resource: 
40                Fn::Join: 
41                  - ""
42                  - - "arn:aws:s3:::"
43                    - Ref: WebScrapingDataBucket
44                    - "/*"

Explanation:

IAM Role Statements: Defines permissions for the Lambda function to interact with S3 at the provider level. This allows listing the bucket contents and reading/writing objects.
WebScrapingDataBucket: This resource block creates the S3 bucket where the data will be stored.
BucketPolicy: This resource applies a bucket policy to the S3 bucket. It specifically allows the Lambda function to put and get objects. The Principal section utilizes the Fn::GetAtt CloudFormation function to dynamically fetch the Amazon Resource Name (ARN) of the Lambda function's execution role.
Resource: The policy applies to all objects within the bucket (/*), ensuring that the Lambda function can interact with any object stored therein.

Writing the Scraping Lambda Function

Updating the serverless.yml Configuration

Define the Scraping Function: Add the new Lambda function that will be triggered by messages from HtmlQueue.
Permissions: Ensure the function has the necessary permissions to access SQS and S3.
Event Source: Configure the function to be triggered by new messages in HtmlQueue.

Here is how you could configure it in the serverless.yml:

1functions:
2  fetchHtmlContent:
3    handler: fetchHtml.handler
4    events:
5      - sqs:
6          arn:
7            Fn::GetAtt:
8              - HtmlQueue
9              - Arn
10    environment:
11      S3_BUCKET: web-scraping-data-yourname
12
13resources:
14  Resources:
15    HtmlQueue:
16      Type: AWS::SQS::Queue
17      Properties:
18        QueueName: htmlQueue

Implementing the Scraping Lambda Function

The function will use node-fetch to fetch the web content.

1const AWS = require('aws-sdk');
2const fetch = require('node-fetch');
3const s3 = new AWS.S3();
4const proxyList = ['http://proxy1.com:port', 'http://proxy2.com:port']; // Example proxy list
5
6exports.handler = async (event) => {
7  for (const record of event.Records) {
8    const { url } = JSON.parse(record.body);
9    const proxy = proxyList[Math.floor(Math.random() * proxyList.length)];
10    const proxyOptions = {
11      headers: {
12        'Proxy-Authorization': 'Basic ' + Buffer.from('user:password').toString('base64'), // if authentication is needed
13      }
14    };
15
16    try {
17      const response = await fetch(url, { agent: new HttpsProxyAgent(proxy), ...proxyOptions });
18      if (!response.ok) {
19        throw new Error(`HTTP error! status: ${response.status}`);
20      }
21      const body = await response.text();
22
23      const params = {
24        Bucket: process.env.S3_BUCKET,
25        Key: `${new Date().getTime()}.html`,
26        Body: body,
27        ContentType: 'text/html',
28      };
29
30      await s3.upload(params).promise();
31      console.log('HTML uploaded successfully:', params.Key);
32    } catch (error) {
33      console.error('Error fetching and uploading HTML:', error);
34    }
35  }
36};

Key Points to Consider

Proxy Rotation: The function uses a simple random selection for proxy rotation. For more sophisticated proxy management, consider using a dedicated proxy rotation service (affiliation).
Error Handling: Robust error handling ensures that failures in fetching URLs or uploading to S3 are logged and managed appropriately.
Security: If you are using proxies that require authentication, ensure that credentials are securely stored, possibly using AWS Secrets Manager or environment variables encrypted using AWS KMS.

Conclusion

We've taken significant steps in establishing a robust and scalable serverless web scraping system using AWS technologies and the Serverless Framework. By carefully configuring and deploying our serverless functions, we are well on our way to creating a fully automated data extraction pipeline that leverages the cloud's power for efficiency and scalability.

Key Achievements

Understanding Serverless Architecture: We started by exploring what serverless computing entails and how its characteristics—such as automatic scaling, cost-effectiveness, and reduced management overhead—make it an ideal choice for web scraping tasks.
Setting Up the Serverless Framework: We installed and configured the Serverless Framework, which serves as the backbone for deploying and managing our AWS services. This setup streamlines the process of deploying code and managing infrastructure, allowing us to focus more on our application logic than on server maintenance.
Creating and Configuring AWS Services: We have successfully set up essential AWS services, including Lambda for running our code, S3 for storing the scraped data, and SQS for managing the messages between different parts of our scraping process. This configuration not only ensures efficient data handling but also robustness through decoupling of components.
Building the Initial Scraping Function: Our first Lambda function, designed to fetch sitemap.xml from target websites and parse it for URLs, has been implemented. This function is crucial as it populates our SQS queue with tasks for further processing, demonstrating the automated trigger-based nature of our serverless architecture.
Introduction of a Function for HTML Fetching: We introduced a crucial Lambda function triggered by messages in the HtmlQueue. This function fetches HTML content using rotating proxies to mitigate the risk of IP bans and uploads the fetched HTML to S3. This addition enhances our pipeline by ensuring real-time data fetching and storage, showcasing the power of integrating multiple serverless services.
Secure and Scalable Data Storage: By integrating Amazon S3, we've ensured that our data is stored securely and is easily accessible for further processing. This setup benefits from S3’s durability and scalability, which are vital for handling potentially large volumes of web data.

Looking Forward

In Part 2 of this series, we will delve deeper into transforming the HTML data into a more structured format (JSON), which can be used for various analytical purposes. We will implement additional Lambda functions to process the HTML data stored in S3, transform it into JSON, and finally store this structured data in a PostgreSQL database. This will complete our end-to-end data processing pipeline, showcasing how serverless technologies can be effectively utilized for complex data processing tasks in web scraping.

Article last update: June 30, 2025

Scraping

Serverless

Ethical Hacking

Frequently Asked Questions

Serverless computing is a cloud execution model where the cloud provider manages infrastructure, allowing developers to focus on code. It automatically scales based on demand, is cost-effective since you pay only for compute time, and reduces management overhead by handling server provisioning and maintenance. This makes it ideal for web scraping tasks that vary in intensity and frequency.

The infrastructure leverages AWS Lambda for serverless functions, S3 for storing web data, SQS for managing message queues, and RDS (PostgreSQL) for structured data storage, combined with Node.js libraries like node-fetch, cheerio, and node-postgres.

The Serverless Framework is an open-source CLI that streamlines building and deploying serverless apps. It abstracts complex cloud service configurations, manages application lifecycle, and allows easy setup of functions, triggers, and resources via a configuration file called 'serverless.yml'.

You create an IAM user in AWS with programmatic access and attach necessary permissions. Then, using the AWS CLI or command line, you run a command to configure your credentials, which are stored locally and used by the Serverless Framework to deploy resources and functions.

The serverless.yml defines the service name, provider details (AWS, runtime, region), functions (like scrapeSite), scheduled events, and resources such as SQS queues and the S3 bucket. It also specifies IAM roles and permissions needed for the functions to interact with other AWS services.

SQS queues are defined in the serverless.yml, allowing functions to send, receive, and delete messages. For example, one queue stores URLs from the sitemap, and functions are triggered by new messages to fetch HTML content or process data, enabling asynchronous and decoupled processing.

The Lambda function uses node-fetch to retrieve sitemap.xml, xml2js to parse the XML into JSON, and then formats each URL into messages that are sent to an SQS queue for further processing. Error handling and retries are incorporated to ensure reliability.

Create a unique S3 bucket, set appropriate bucket policies to allow Lambda functions to put and get objects, and manage permissions securely. S3 offers durability, scalability, and cost-effectiveness, making it ideal for storing large amounts of web scraped data.

You add resources to define an S3 bucket and a bucket policy granting necessary permissions to Lambda functions. The functions can then upload directly to S3, with IAM roles configured to allow access. This setup ensures secure and scalable data storage.

In Part 2, the focus will be on transforming raw HTML data into structured JSON, processing it with additional Lambda functions, and loading the cleaned data into a PostgreSQL database for analysis, completing the end-to-end data processing pipeline.

Blog

For solo engineers, but not only.

Scalable Web Scraping with Serverless - Part 1

Frequently Asked Questions

What is serverless computing and how does it benefit web scraping?

Which AWS services are used in building the scalable web scraping infrastructure?

How does the Serverless Framework simplify deploying web scraping applications?

How do you configure AWS credentials for the Serverless Framework?

What is included in the serverless.yml configuration file for this web scraping project?

How can Amazon SQS be integrated into the serverless web scraping infrastructure?

How is the initial Lambda function set up to fetch sitemap.xml and generate URLs?

What are key considerations when storing scraped data in AWS S3?

How do you extend the serverless.yml to include S3 storage and permissions?

What are the next steps after storing web data in S3 in a serverless scraping pipeline?

Latest Posts