Scalable Web Scraping with Serverless - Part 1

April 27, 2024Lev Gelfenbuim12 min. read

Today I'll show you how to build a scalable infrastructure for web scraping using Serverless Framework.

Serverless computing offers a compelling solution by eliminating the need to manage infrastructure, allowing developers to focus solely on code. This model is perfect for web scraping tasks that vary in intensity and frequency, as it scales automatically to meet demand without any manual intervention. Moreover, serverless functions are cost-effective, as you only pay for the compute time you consume, making it an ideal choice for both small-scale projects and large-scale data extraction tasks.

In this blog series, we will explore how to leverage serverless technologies to build a robust and scalable web scraping infrastructure. We'll use a suite of AWS services including Lambda, S3, SQS, and RDS, combined with popular Node.js libraries like node-fetch for fetching data, cheerio for parsing HTML, and node-postgres to interact with databases.

Our journey will be split into two parts:

  1. Part 1: We'll set up our serverless environment, deploy functions to fetch and store web data, and orchestrate our components using AWS services.
  2. Part 2: We will transform our raw HTML data into structured JSON and seamlessly load this data into a PostgreSQL database for further analysis.
Understanding Serverless Architecture
What is Serverless Computing?

Serverless computing is a cloud computing execution model where the cloud provider manages the setup, capacity planning, and server management for you. Essentially, it allows developers to write and deploy code without worrying about the underlying infrastructure. In a serverless setup, you only pay for the compute time you consume—there is no charge when your code is not running.

Key Benefits of Serverless for Web Scraping
  • Automatic Scaling: Serverless functions automatically scale based on the demand. For web scraping, this means you can handle virtually any number of pages or sites without needing to manually adjust the capacity of servers.
  • Cost-Effectiveness: With serverless, you pay per execution. You aren’t charged for idle resources. This can be incredibly cost-efficient compared to running dedicated servers 24/7, especially for scraping tasks that might only need to run a few times a day or week.
  • Reduced Management Overhead: The serverless model offloads much of the operational burden to the cloud provider. Maintenance tasks like server provisioning, patching, and administration are handled by the provider, freeing you to focus on improving your scraping logic and handling data.
How Serverless Fits Into Web Scraping

Serverless is particularly adept at handling event-driven applications—like those triggered by a scheduled time to scrape data or a new item appearing in a queue. This fits naturally with the often sporadic and on-demand nature of web scraping, where workloads can be highly variable and unpredictable.

Popular Serverless Platforms

While AWS Lambda is one of the most popular serverless computing services, other cloud providers offer similar functionalities:

  • Azure Functions: Microsoft's equivalent to AWS Lambda, offering easy integration with other Azure services.
  • Google Cloud Functions: A lightweight, event-driven computing solution that can automatically scale based on the workload.
  • IBM Cloud Functions: Based on Apache OpenWhisk, IBM’s offering also supports serverless computing within their cloud ecosystem.

Each platform has its own strengths and pricing models, but AWS Lambda will be our focus due to its maturity, extensive documentation, and seamless integration with other AWS services like S3 and SQS.

Configuring the Serverless Framework

After gaining a foundational understanding of serverless architecture, our next step involves setting up the Serverless Framework. This powerful framework simplifies deploying serverless applications and managing their lifecycle. In this section, we'll cover how to configure the Serverless Framework to work seamlessly with AWS services, a crucial component for our scalable web scraping infrastructure.

Introduction to the Serverless Framework

The Serverless Framework is an open-source CLI that provides developers with a streamlined workflow for building and deploying serverless applications. It abstracts much of the complexity associated with configuring various cloud services, making it easier to launch applications across different cloud providers like AWS, Azure, and Google Cloud.

Installing the Serverless Framework

To begin, you need to have Node.js installed on your machine as the Serverless Framework runs on it. Once Node.js is set up, you can install the Serverless Framework globally using npm:

1npm install -g serverless

Or, if you prefer using Yarn:

1yarn global add serverless
Configuring AWS Credentials

To deploy functions to AWS Lambda and manage resources, you need to configure your AWS credentials in the Serverless Framework:

  1. Create an AWS IAM User: Log in to your AWS Management Console and navigate to the IAM service. Create a new user with programmatic access. Attach the AdministratorAccess policy to this user for now, which grants the necessary permissions. For production, you should customize the permissions to follow the principle of least privilege.
  2. Configure Credentials: After creating your IAM user, you will receive an access key ID and secret access key. Use these credentials to configure the Serverless Framework by running the following command:
1serverless config credentials --provider aws --key YOUR_ACCESS_KEY_ID --secret YOUR_SECRET_ACCESS_KEY

This command stores your credentials in a local file, which the Serverless Framework uses to interact with your AWS account.

Setting Up serverless.yml

Every Serverless Framework project contains a serverless.yml file at its root. This file is crucial as it defines the service configuration, including functions, events, and resources. Here’s a basic setup for our web scraping project:

1service: web-scraping-service
2
3provider:
4  name: aws
5  runtime: nodejs14.x
6  region: us-east-1
7
8functions:
9  scrapeSite:
10    handler: handler.scrape
11    events:
12      - schedule:
13          rate: cron(0 */4 * * ? *)  # Runs every 4 hours

In this configuration:

  • service: Defines the name of your serverless application.
  • provider: Specifies AWS as the provider and sets the runtime environment and region.
  • functions: Lists the functions to deploy. In this example, scrapeSite is triggered by a scheduled event.
Extending the Serverless Configuration for SQS

To fully harness the power of serverless architecture for our web scraping infrastructure, integrating Amazon Simple Queue Service (SQS) is essential. SQS will manage the messages related to tasks such as notifying our system when new data is available for processing. Here's how to extend our existing serverless.yml configuration to include SQS resources:

Adding SQS to the Serverless Framework Configuration

SQS queues can be defined and managed directly within the serverless.yml file, which allows for seamless integration and management within our AWS ecosystem. Let’s add two SQS queues to our setup—one for HTML files and another for JSON data:

1service: web-scraping-service
2
3provider:
4  name: aws
5  runtime: nodejs14.x
6  region: us-east-1
7  iamRoleStatements:
8    - Effect: Allow
9      Action:
10        - sqs:SendMessage
11        - sqs:ReceiveMessage
12        - sqs:DeleteMessage
13        - sqs:GetQueueAttributes
14      Resource: "*"
15
16resources:
17  Resources:
18    HtmlQueue:
19      Type: AWS::SQS::Queue
20      Properties:
21        QueueName: htmlQueue
22
23functions:
24  scrapeSite:
25    handler: handler.scrape
26    events:
27      - schedule:
28          rate: cron(0 */4 * * ? *)  # Runs every 4 hours
29    environment:
30      HTML_QUEUE_URL: !Ref HtmlQueue

Explanation of the Configuration:

  • IAM Role Statements: Specifies the permissions for the Lambda functions to interact with SQS, allowing them to send, receive, and delete messages.
  • Resources: Defines the SQS queue. The HtmlQueue is used to store messages that contain references to HTML files stored in S3.
  • Functions:
    • scrapeSite: This function is triggered on a schedule to scrape websites and push references to the HtmlQueue.
Creating Your First Scraping Function with AWS Lambda

With the Serverless Framework configured and ready, the next step in building our scalable web scraping infrastructure is to develop the first AWS Lambda function. This function will be responsible for fetching web data periodically based on the sitemap of a target website. Let's dive into creating and deploying this essential component.

Understanding AWS Lambda

AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the compute resources for you. Lambda is ideal for handling tasks that respond to HTTP requests, process queue messages, or, as in our case, execute tasks on a schedule.

Setting Up the Lambda Function
  1. Initialize a Serverless Project: Begin by creating a new directory for your project and navigate into it. Use the Serverless CLI to create a new service:
1serverless create --template aws-nodejs --path my-web-scraper
2cd my-web-scraper

This command sets up a basic Node.js project with a serverless.yml file and a sample handler function in handler.js.

  1. Install Dependencies: For fetching and parsing the sitemap, we'll use node-fetch for HTTP requests and xml2js for converting XML data into a JSON object.
1npm install node-fetch xml2js
Writing the Lambda Function

Create a new function in the handler.js file or whatever your main file is named. This function will handle the fetching of the sitemap and parsing it to get URLs:

1exports.scrape = async () => {
2  try {
3    const sitemapUrl = 'https://example.com/sitemap.xml';
4    const response = await fetch(sitemapUrl);
5    if (!response.ok) {
6      throw new Error(`Failed to fetch sitemap: ${response.statusText}`);
7    }
8    const sitemapXml = await response.text();
9
10    const parsedSitemap = await xml2js.parseStringPromise(sitemapXml);
11    const urls = parsedSitemap.urlset.url.map(u => u.loc[0]);
12
13    for (const url of urls) {
14      const params = {
15        MessageBody: JSON.stringify({ url }),
16        QueueUrl: process.env.HTML_QUEUE_URL,
17      };
18
19      await sqs.sendMessage(params).promise();
20    }
21  } catch (error) {
22    console.error('Error during scraping process:', error);
23    // Optionally rethrow or handle error specifically (e.g., retry logic, notification)
24  }
25};
Key Components of the Function
  • Fetching the Sitemap: The function starts by fetching the sitemap.xml using node-fetch, a lightweight module suited for such tasks.
  • Parsing XML: The sitemap XML is parsed into a JavaScript object using xml2js, which allows easy access to the URLs listed in the sitemap.
  • Pushing URLs to SQS: Each URL is then formatted into a message and sent to the SQS queue designated for HTML page processing.
Error Handling

To enhance the reliability of your Lambda function, especially in a web scraping context where external dependencies like network issues or site changes can lead to errors, error handling is a must. Here’s how you can implement comprehensive error handling in your serverless function:

Error Monitoring and Logging

Integrate error monitoring and logging tools to capture errors for further analysis. Tools like AWS CloudWatch can be configured to monitor logs, which helps in diagnosing issues after deployment.

1provider:
2  name: aws
3  runtime: nodejs14.x
4  region: us-east-1
5  iamRoleStatements:
6    # IAM permissions here
7  logs:
8    restApi:
9      loggingLevel: ERROR
10      fullExecutionData: true
Dead Letter Queues (DLQ)

Configure Dead Letter Queues (DLQ) in SQS for messages that cannot be processed after several attempts. This approach helps in isolating problematic messages and prevents them from clogging your processing pipeline.

1resources:
2  Resources:
3    HtmlQueue:
4      Type: AWS::SQS::Queue
5      Properties:
6        QueueName: htmlQueue
7        RedrivePolicy:
8          deadLetterTargetArn: !GetAtt HtmlDeadLetterQueue.Arn
9          maxReceiveCount: 3  # Adjust based on your needs
10
11    HtmlDeadLetterQueue:
12      Type: AWS::SQS::Queue
13      Properties:
14        QueueName: htmlDeadLetterQueue
Timeout and Memory Management

Properly manage timeouts and memory settings in your Lambda function to prevent unexpected terminations or performance issues.

1functions:
2  scrapeSite:
3    handler: handler.scrape
4    timeout: 30  # seconds
5    memorySize: 256  # MB
6    events:
7      - schedule:
8          rate: cron(0 */4 * * ? *)  # Runs every 4 hours
Testing and Deployment

Before deploying your function, test it locally or in a development environment to ensure it behaves as expected. The Serverless Framework provides commands for invoking functions locally and deploying them to AWS.

  • Local Testing
1serverless invoke local --function scrapeSite
  • Deployment
1serverless deploy
Storing Web Data in AWS S3

After successfully setting up and deploying our Lambda function to scrape web data, the next crucial step in our serverless web scraping pipeline involves efficiently storing the retrieved data. Amazon S3 (Simple Storage Service) offers a robust solution for this purpose, providing scalability, data availability, security, and performance. This section will guide you through setting up S3 buckets and configuring your Lambda function to save scraped data directly to S3.

Why Choose Amazon S3?

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Some key benefits include:

  • Durability and Availability: S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements. It gives you flexibility in the way you manage data for cost optimization, access control, and compliance.
  • Scalability: Automatically scale your storage without worrying about the underlying infrastructure. This is crucial for web scraping applications where the amount of data can grow unpredictably.
  • Cost-Effectiveness: With S3, you pay only for the storage you use. There are no minimum fees or setup costs, which makes it a cost-effective solution for storing web scraped data.
Setting Up S3 Buckets

Before you can store data, you need to create an S3 bucket in your AWS account. Each bucket's name must be unique across all existing bucket names in Amazon S3 (bucket names are shared among all users globally).

  1. Create a Bucket:
    • Go to the AWS Management Console.
    • Navigate to S3 and select “Create bucket”.
    • Provide a unique bucket name, e.g., web-scraping-data-yourname.
    • Select the AWS region where you want the bucket to reside.
    • Leave the default settings or configure options like versioning or logging based on your specific requirements.
  2. Bucket Policy:
    • To define the bucket policy directly in your serverless.yml configuration, you can use AWS CloudFormation resources that the Serverless Framework supports. Here's how to include a bucket policy that specifically allows a Lambda function to put and get objects in an S3 bucket:
1service: web-scraping-service
2
3provider:
4  name: aws
5  runtime: nodejs14.x
6  region: us-east-1
7  iamRoleStatements:
8    - Effect: "Allow"
9      Action:
10        - "s3:ListBucket"
11      Resource: "arn:aws:s3:::web-scraping-data-yourname"
12    - Effect: "Allow"
13      Action:
14        - "s3:PutObject"
15        - "s3:GetObject"
16      Resource: "arn:aws:s3:::web-scraping-data-yourname/*"
17
18resources:
19  Resources:
20    WebScrapingDataBucket:
21      Type: AWS::S3::Bucket
22      Properties:
23        BucketName: web-scraping-data-yourname
24
25    BucketPolicy:
26      Type: AWS::S3::BucketPolicy
27      Properties:
28        Bucket:
29          Ref: WebScrapingDataBucket
30        PolicyDocument:
31          Statement:
32            - Effect: "Allow"
33              Principal:
34                AWS: 
35                  Fn::GetAtt: [LambdaExecutionRole, Arn]
36              Action:
37                - "s3:PutObject"
38                - "s3:GetObject"
39              Resource: 
40                Fn::Join: 
41                  - ""
42                  - - "arn:aws:s3:::"
43                    - Ref: WebScrapingDataBucket
44                    - "/*"

Explanation:

  • IAM Role Statements: Defines permissions for the Lambda function to interact with S3 at the provider level. This allows listing the bucket contents and reading/writing objects.
  • WebScrapingDataBucket: This resource block creates the S3 bucket where the data will be stored.
  • BucketPolicy: This resource applies a bucket policy to the S3 bucket. It specifically allows the Lambda function to put and get objects. The Principal section utilizes the Fn::GetAtt CloudFormation function to dynamically fetch the Amazon Resource Name (ARN) of the Lambda function's execution role.
  • Resource: The policy applies to all objects within the bucket (/*), ensuring that the Lambda function can interact with any object stored therein.
Writing the Scraping Lambda Function
Updating the serverless.yml Configuration
  1. Define the Scraping Function: Add the new Lambda function that will be triggered by messages from HtmlQueue.
  2. Permissions: Ensure the function has the necessary permissions to access SQS and S3.
  3. Event Source: Configure the function to be triggered by new messages in HtmlQueue.

Here is how you could configure it in the serverless.yml:

1functions:
2  fetchHtmlContent:
3    handler: fetchHtml.handler
4    events:
5      - sqs:
6          arn:
7            Fn::GetAtt:
8              - HtmlQueue
9              - Arn
10    environment:
11      S3_BUCKET: web-scraping-data-yourname
12
13resources:
14  Resources:
15    HtmlQueue:
16      Type: AWS::SQS::Queue
17      Properties:
18        QueueName: htmlQueue
Implementing the Scraping Lambda Function

The function will use node-fetch to fetch the web content.

1const AWS = require('aws-sdk');
2const fetch = require('node-fetch');
3const s3 = new AWS.S3();
4const proxyList = ['http://proxy1.com:port', 'http://proxy2.com:port']; // Example proxy list
5
6exports.handler = async (event) => {
7  for (const record of event.Records) {
8    const { url } = JSON.parse(record.body);
9    const proxy = proxyList[Math.floor(Math.random() * proxyList.length)];
10    const proxyOptions = {
11      headers: {
12        'Proxy-Authorization': 'Basic ' + Buffer.from('user:password').toString('base64'), // if authentication is needed
13      }
14    };
15
16    try {
17      const response = await fetch(url, { agent: new HttpsProxyAgent(proxy), ...proxyOptions });
18      if (!response.ok) {
19        throw new Error(`HTTP error! status: ${response.status}`);
20      }
21      const body = await response.text();
22
23      const params = {
24        Bucket: process.env.S3_BUCKET,
25        Key: `${new Date().getTime()}.html`,
26        Body: body,
27        ContentType: 'text/html',
28      };
29
30      await s3.upload(params).promise();
31      console.log('HTML uploaded successfully:', params.Key);
32    } catch (error) {
33      console.error('Error fetching and uploading HTML:', error);
34    }
35  }
36};
Key Points to Consider
  • Proxy Rotation: The function uses a simple random selection for proxy rotation. For more sophisticated proxy management, consider using a dedicated proxy rotation service (affiliation).
  • Error Handling: Robust error handling ensures that failures in fetching URLs or uploading to S3 are logged and managed appropriately.
  • Security: If you are using proxies that require authentication, ensure that credentials are securely stored, possibly using AWS Secrets Manager or environment variables encrypted using AWS KMS.
Conclusion

We've taken significant steps in establishing a robust and scalable serverless web scraping system using AWS technologies and the Serverless Framework. By carefully configuring and deploying our serverless functions, we are well on our way to creating a fully automated data extraction pipeline that leverages the cloud's power for efficiency and scalability.

Key Achievements
  1. Understanding Serverless Architecture: We started by exploring what serverless computing entails and how its characteristics—such as automatic scaling, cost-effectiveness, and reduced management overhead—make it an ideal choice for web scraping tasks.
  2. Setting Up the Serverless Framework: We installed and configured the Serverless Framework, which serves as the backbone for deploying and managing our AWS services. This setup streamlines the process of deploying code and managing infrastructure, allowing us to focus more on our application logic than on server maintenance.
  3. Creating and Configuring AWS Services: We have successfully set up essential AWS services, including Lambda for running our code, S3 for storing the scraped data, and SQS for managing the messages between different parts of our scraping process. This configuration not only ensures efficient data handling but also robustness through decoupling of components.
  4. Building the Initial Scraping Function: Our first Lambda function, designed to fetch sitemap.xml from target websites and parse it for URLs, has been implemented. This function is crucial as it populates our SQS queue with tasks for further processing, demonstrating the automated trigger-based nature of our serverless architecture.
  5. Introduction of a Function for HTML Fetching: We introduced a crucial Lambda function triggered by messages in the HtmlQueue. This function fetches HTML content using rotating proxies to mitigate the risk of IP bans and uploads the fetched HTML to S3. This addition enhances our pipeline by ensuring real-time data fetching and storage, showcasing the power of integrating multiple serverless services.
  6. Secure and Scalable Data Storage: By integrating Amazon S3, we've ensured that our data is stored securely and is easily accessible for further processing. This setup benefits from S3’s durability and scalability, which are vital for handling potentially large volumes of web data.
Looking Forward

In Part 2 of this series, we will delve deeper into transforming the HTML data into a more structured format (JSON), which can be used for various analytical purposes. We will implement additional Lambda functions to process the HTML data stored in S3, transform it into JSON, and finally store this structured data in a PostgreSQL database. This will complete our end-to-end data processing pipeline, showcasing how serverless technologies can be effectively utilized for complex data processing tasks in web scraping.

Article last update: May 13, 2024

Scraping
Serverless
Ethical Hacking

Frequently Asked Questions

Serverless computing is a cloud execution model where the cloud provider manages infrastructure tasks like setup, capacity planning, and server management. This model benefits web scraping by automatically scaling to meet demand, being cost-effective by charging only for compute time consumed, and reducing management overhead.

The AWS services used include Lambda for running serverless functions, S3 for storing scraped data, SQS for managing task-related messages, and RDS (specifically PostgreSQL) for database storage.

In a serverless architecture, the cloud provider takes care of server provisioning, patching, and maintenance tasks, allowing developers to focus solely on writing code and improving their application logic.

Key steps include installing Node.js, installing the Serverless Framework globally using npm or yarn, configuring AWS credentials, and setting up the serverless.yml file to define service configuration, functions, events, and resources.

Amazon S3 is chosen for its durability, availability, scalability, and cost-effectiveness. It provides secure and reliable storage that automatically scales to meet data storage needs, making it ideal for unpredictable and potentially large volumes of web scraped data.

Other popular serverless platforms mentioned include Microsoft Azure Functions, Google Cloud Functions, and IBM Cloud Functions.

SQS manages messages that notify the system when new data is available for processing, enabling efficient task distribution and decoupling the scraping and processing components for better scalability and reliability.

The node-fetch library is used within the Lambda function to make HTTP requests to fetch data, such as retrieving the sitemap.xml from target websites.

AWS IAM roles are configured to grant necessary permissions to Lambda functions for interacting with other AWS services, such as reading and writing to S3 or sending and receiving messages from SQS.

Recommended measures include integrating error monitoring and logging tools like AWS CloudWatch, configuring Dead Letter Queues (DLQ) for managing unprocessed messages, and properly managing function timeouts and memory settings to prevent unexpected terminations.

Latest Posts