Scalable Web Scraping with Serverless - Part 2

May 13, 2024Lev Gelfenbuim11 min. read

Today we will focus on processing the HTML data stored in AWS S3, transforming it into JSON format, and then loading this data into a PostgreSQL database. This will complete the pipeline by adding data processing and storage capabilities, making your system capable of handling end-to-end data operations in a scalable and efficient manner.


In the first part of our series on building a scalable serverless web scraping system, we successfully set up the foundational components. We created AWS Lambda functions to fetch sitemap.xml, extract URLs, and store HTML content in Amazon S3, orchestrated by SQS for seamless message handling. Having established a robust infrastructure for scraping and initial data storage, it's time to move forward with the next stages of our data pipeline.

In Part 2 of this series, we will focus on transforming the HTML content stored in S3 into a more structured JSON format. This transformation is crucial as it facilitates easier data manipulation and integration with various analytics tools and databases. We will then detail how to load this structured data into a PostgreSQL database hosted on Amazon RDS, effectively completing our end-to-end serverless web scraping solution.

Goals of Part 2
  • Transform HTML into JSON: We will implement a Lambda function to parse HTML files retrieved from S3, extract necessary information using the cheerio library, and convert this data into JSON. This structured format makes the data more accessible and usable for further processing and analysis.
  • Store Data in PostgreSQL: After transforming the data into JSON, we will load it into a PostgreSQL database. We will set up a relational database schema suitable for our data and use another Lambda function to handle the insertion process. This step is vital for persistent storage and efficient querying of scraped data.
  • Automation and Orchestration: By integrating SQS, we will automate the workflow from data scraping to storage. Each step of the process will trigger the next, ensuring a smooth data flow and reducing manual intervention.
  • Optimization and Monitoring: Lastly, we will discuss monitoring the performance of our serverless functions using AWS CloudWatch and optimize the system to ensure cost-efficiency and high performance.
Setting Up the HTML to JSON Transformation

The process of transforming the HTML to JSON is essential as it allows us to extract relevant data from the raw HTML and prepare it for analytical tasks or storage in a database. In this section, we'll cover the setup of a Lambda function that fetches HTML from S3, parses it using cheerio, and transforms it into JSON.

Implementation Steps
  • Fetching HTML from S3: The function starts by retrieving the HTML file from S3 using the key provided in the SQS message.
  • Parsing HTML with Cheerio: Utilize cheerio, a fast, flexible, and lean implementation of core jQuery designed specifically for the server, to parse the HTML content.
  • Extracting Relevant Data: Identify and extract the necessary pieces of information from the parsed HTML. This might include elements like headings, paragraphs, links, or other data embedded in the HTML structure.
  • Constructing JSON Objects: Assemble the extracted data into a JSON object. This structured format makes it easier to manipulate and store the data subsequently.

Sample Code for the Lambda Function

Here is a basic example of what the Lambda function’s code might look like:

1const AWS = require('aws-sdk');
2const s3 = new AWS.S3();
3const cheerio = require('cheerio');
5exports.handler = async (event) => {
6  for (const record of event.Records) {
7    const { bucketName, objectKey } = JSON.parse(record.body);
9    // Fetch the HTML file from S3
10    const htmlData = await s3.getObject({
11      Bucket: bucketName,
12      Key: objectKey
13    }).promise();
15    // Load HTML into Cheerio
16    const $ = cheerio.load(htmlData.Body.toString('utf-8'));
18    // Extract relevant data
19    const pageData = {
20      title: $('title').text(),
21      headings: $('h1, h2, h3').map((i, el) => $(el).text()).get(),
22      links: $('a').map((i, el) => $(el).attr('href')).get()
23    };
25    // Construct JSON object
26    const jsonData = JSON.stringify(pageData);
28    // Additional code to handle JSON data as needed
29    console.log(jsonData);
30  }

To incorporate the Lambda function for transforming HTML into JSON into your serverless.yml configuration, you'll need to define the function, set its trigger, and ensure it has the necessary permissions to access AWS S3 and SQS. Here's a detailed step-by-step guide to updating your serverless.yml file:

1service: web-scraping-service
4  name: aws
5  runtime: nodejs14.x  # Specify your Node.js version
6  region: us-east-1
7  iamRoleStatements:
8    - Effect: Allow
9      Action:
10        - s3:GetObject
11        - s3:PutObject
12      Resource:
13        - arn:aws:s3:::your-html-bucket-name/*
14        - arn:aws:s3:::your-json-bucket-name/*
15    - Effect: Allow
16      Action:
17        - sqs:ReceiveMessage
18        - sqs:DeleteMessage
19        - sqs:GetQueueAttributes
20      Resource: 
21        - !GetAtt HtmlQueue.Arn
24  transformHtmlToJson:
25    handler: transformHtmlToJson.handler  # Update with your actual handler file and function
26    events:
27      - sqs:
28          arn: !GetAtt HtmlQueue.Arn
29    environment:
30      HTML_BUCKET_NAME: your-html-bucket-name
31      JSON_BUCKET_NAME: your-json-bucket-name
34  Resources:
35    HtmlQueue:
36      Type: AWS::SQS::Queue
37      Properties:
38        QueueName: htmlQueue
39    JsonQueue:
40      Type: AWS::SQS::Queue
41      Properties:
42        QueueName: jsonQueue

Key Components of the Configuration

  • iamRoleStatements: This section defines the necessary permissions for the Lambda function. It needs to be able to read from the S3 bucket where HTML files are stored and optionally write to another S3 bucket where JSON files might be saved. It also needs permissions to interact with the SQS queue.
  • functions: Under this section, the transformHtmlToJson function is defined. It points to a handler, which should match a function in your code that processes the HTML and converts it to JSON. The function is triggered by new messages in HtmlQueue, as indicated in the events subsection.
  • environment: Here, environment variables are defined to make the S3 bucket names easily configurable and accessible within the Lambda function code.
  • resources: Includes setting logical names and properties such as QueueName.
Storing the Transformed JSON
  • Once the HTML is transformed into JSON, you may choose to store it back in S3 for temporary holding or further processing.
  • Configure the function to upload the JSON files to a designated S3 bucket, ensuring that they are organized and accessible for the next steps in your data pipeline.
Implementing Data Loading to PostgreSQL

After transforming HTML content into structured JSON, the next step in our serverless web scraping infrastructure involves storing this data in a PostgreSQL database. Using Amazon RDS (Relational Database Service) to host PostgreSQL provides a scalable, secure, and managed database service that integrates seamlessly with AWS Lambda. This section will guide us through setting up a PostgreSQL instance, creating a Lambda function to load JSON data into the database, and ensuring smooth data flow and integrity.

Setting Up PostgreSQL on Amazon RDS
  1. Create a PostgreSQL Database Instance:
  • Navigate to the RDS section in your AWS Management Console.
  • Click "Create database" and select PostgreSQL as the database engine.
  • Configure the instance settings such as instance class, storage, and connectivity. Enable public accessibility if you need to connect from your local environment for testing.
  • Set the initial database name, master username, and password. Store these credentials securely, preferably using AWS Secrets Manager.
  1. Security and Network Configuration:
  • Ensure that the database instance is placed in a secure VPC.
  • Modify the VPC security groups to allow inbound traffic on the default PostgreSQL port (5432) from your Lambda function's security group.
A screenshot from AWS RDS demonstrating creation of a new DB
AWS RDS: New DB Creation
Creating the Lambda Function for Database Insertion

For database insertions, we will define a new Lambda function in serverless.yml that will be triggered by new messages in the SQS queue, after the JSON transformation process.

Sample Code for the Lambda Function

Here is a basic example of what the Lambda function’s code might look like:

1const { Pool } = require('pg');
2const AWS = require('aws-sdk');
3const s3 = new AWS.S3();
5// Configure PostgreSQL connection using environment variables
6const pool = new Pool({
7  user: process.env.PG_USER,
8  host: process.env.PG_HOST,
9  database: process.env.PG_DATABASE,
10  password: process.env.PG_PASSWORD,
11  port: process.env.PG_PORT,
14exports.handler = async (event) => {
15  for (const record of event.Records) {
16    const objectKey = JSON.parse(record.body).objectKey;
17    const bucketName = 'your-json-bucket-name';
19    // Fetch JSON from S3
20    const data = await s3.getObject({
21      Bucket: bucketName,
22      Key: objectKey
23    }).promise();
25    const jsonData = JSON.parse(data.Body.toString('utf-8'));
27    // Insert data into PostgreSQL
28    const client = await pool.connect();
29    try {
30      await client.query('INSERT INTO your_table (data_field) VALUES ($1)', [jsonData]);
31    } catch (err) {
32      console.error('Error inserting data into PostgreSQL', err);
33    } finally {
34      client.release();
35    }
36  }

Modifying the serverless.yml Configuration

First, define the environment variables in your serverless.yml file under the provider or function specific environment section:

2  name: aws
3  environment:
4    PG_USER: ${ssm:/myapp/prod/db/username}
5    PG_HOST: ${ssm:/myapp/prod/db/host}
6    PG_DATABASE: ${ssm:/myapp/prod/db/database}
7    PG_PASSWORD: ${ssm:/myapp/prod/db/password}
8    PG_PORT: 5432
9  iamRoleStatements:
10   - Effect: "Allow"
11    Action:
12      - "secretsmanager:GetSecretValue"
13      - "ssm:GetParameter"
14    Resource: "*"

In this configuration, ${ssm:/myapp/prod/db/username} refers to a parameter stored in the AWS Systems Manager Parameter Store, which is a secure place to store configuration data and secrets.

Then, add define the load function:

2  loadDataToPostgreSQL:
3    handler: handlers.loadDataToPostgreSQL
4    events:
5      - sqs:
6          arn: !GetAtt JsonQueue.Arn
7    environment:
8      S3_BUCKET_NAME: your-json-bucket-name

Key Components of the Configuration:

  • environment: The PostgreSQL credentials are fetched from the AWS Systems Manager Parameter Store, ensuring they are managed securely and not hard-coded into your application.
  • iamRoleStatements: Permissions are set to allow the Lambda function to access S3, SQS, and securely fetch parameters from the Systems Manager or Secrets Manager.
  • functions: The function is configured to be triggered by messages arriving in the SQS queue (JsonQueue), which are expected to contain notifications when JSON data is ready to be processed and loaded into the database.
  • resources: The JsonQueue SQS queue is defined in the resources section to manage the flow of JSON data processing tasks.
Orchestrating the Workflow with SQS

In our serverless web scraping system, Amazon Simple Queue Service (SQS) plays a crucial role in orchestrating the workflow between different components. By adopting SQS, we can efficiently manage the communication and data flow between our Lambda functions, ensuring a seamless and scalable operation. In this section we will discuss how we utilize SQS to coordinate the data processing steps from scraping HTML to transforming it into JSON, and finally storing it in PostgreSQL.

Why Use SQS?

SQS offers a robust, fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. Using SQS, you can send, store, and receive messages between software components without losing messages or requiring other services to be available. Here’s how SQS benefits our application:

  • Decoupling Components: SQS allows components of our application to operate independently, improving fault tolerance and scalability.
  • Handling Spikes in Workload: During peak times of data ingestion, SQS can handle increased volumes of messages, ensuring that our processing does not get overwhelmed.
  • Guaranteeing Message Delivery: SQS offers best-effort ordering to ensure messages are processed at least once and in the correct order as much as possible.
Managing SQS Messages
  1. Visibility Timeout: Configure the visibility timeout setting based on the expected processing time of your Lambda functions to prevent duplicate processing.
  2. Error Handling: Implement dead letter queues (DLQ) for queues to handle messages that cannot be processed after several attempts.

The following example shows how to set up a Lambda function triggered by an SQS queue with a specific batch size, visibility timeout, and a dead letter queue for handling message processing failures:

1service: web-scraping-service
4  name: aws
5  runtime: nodejs14.x
8  processHtml:
9    handler: handlers.processHtml
10    events:
11      - sqs:
12          arn: !GetAtt HtmlQueue.Arn
13          batchSize: 10
14          maximumRetryAttempts: 3
15          visibilityTimeout: 300
18  Resources:
19    HtmlQueue:
20      Type: AWS::SQS::Queue
21      Properties:
22        QueueName: htmlQueue
24    HtmlDeadLetterQueue:
25      Type: AWS::SQS::Queue
26      Properties:
27        QueueName: htmlDeadLetterQueue
29    HtmlQueueDLQPolicy:
30      Type: AWS::SQS::QueuePolicy
31      Properties:
32        Queues:
33          - !Ref HtmlDeadLetterQueue
34        PolicyDocument:
35          Statement:
36            - Effect: Allow
37              Principal: "*"
38              Action:
39                - sqs:SendMessage
40              Resource: !GetAtt HtmlDeadLetterQueue.Arn
41              Condition:
42                ArnEquals:
43                  "aws:SourceArn": !GetAtt HtmlQueue.Arn
Monitoring and Adjusting
  1. Monitoring: Use AWS CloudWatch to monitor the number of messages sent and received, as well as the number of empty receives, which can indicate inefficiencies in your queue processing.
  2. Adjusting SQS Settings: Based on the monitoring insights, adjust settings like batch size, visibility timeout, and retry policies to optimize the performance and cost-effectiveness of your message processing.

The following example demonstrates how you can modify a Lambda function's code to send custom metrics to AWS CloudWatch using the AWS SDK for Node.js. This can be very useful for tracking custom operational metrics like function execution times, the number of processed items, errors, or other specific metrics that help in monitoring and optimizing your serverless application.

First, add the appropriate IAM Role statement to serverless.yml:

2  name: aws
3  runtime: nodejs14.x
4  iamRoleStatements:
5    - Effect: Allow
6      Action:
7        - cloudwatch:PutMetricData
8      Resource: "*"
9  environment:
10    CLOUDWATCH_NAMESPACE: 'WebScrapingService'

Use putMetricData to send metrics to CloudWatch:

1const AWS = require('aws-sdk');
2// Instantiate CloudWatch client
3const cloudwatch = new AWS.CloudWatch({ apiVersion: '2010-08-01' });
5exports.handler = async (event) => {
6    const startTime = new Date(); // Record start time for execution timing
8    try {
9        // Example process: Parse messages from SQS event
10        const numMessages = event.Records.length;
12        // Simulate some processing logic
13        event.Records.forEach(record => {
14            console.log('Processing record:', record.messageId);
15            // Further processing here...
16        });
18        // Calculate processing time
19        const endTime = new Date();
20        const processingTime = endTime - startTime;
22        // Send custom metric for processing time
23        await cloudwatch.putMetricData({
24            Namespace: 'WebScrapingService', // Custom namespace
25            MetricData: [
26                {
27                    MetricName: 'ProcessingTime',
28                    Dimensions: [
29                        {
30                            Name: 'FunctionName',
31                            Value: 'processHtml'
32                        }
33                    ],
34                    Unit: 'Milliseconds',
35                    Value: processingTime
36                },
37                {
38                    MetricName: 'ProcessedMessages',
39                    Dimensions: [
40                        {
41                            Name: 'FunctionName',
42                            Value: 'processHtml'
43                        }
44                    ],
45                    Unit: 'Count',
46                    Value: numMessages
47                }
48            ]
49        }).promise();
51        console.log('Metrics pushed to CloudWatch');
52    } catch (error) {
53        console.error('Error processing messages:', error);
55        // Send error metric to CloudWatch
56        await cloudwatch.putMetricData({
57            Namespace: 'WebScrapingService',
58            MetricData: [
59                {
60                    MetricName: 'Errors',
61                    Dimensions: [
62                        {
63                            Name: 'FunctionName',
64                            Value: 'processHtml'
65                        }
66                    ],
67                    Unit: 'Count',
68                    Value: 1
69                }
70            ]
71        }).promise();
73        throw error; // Re-throw error to be handled by Lambda or trigger DLQ
74    }

Key Elements in the Code:

  • CloudWatch Client Initialization: The AWS SDK's CloudWatch client is initialized to facilitate sending metrics.
  • Tracking Execution Time: The code tracks how long the Lambda function takes to process its input, which is useful for performance monitoring.
  • Counting Messages: It also tracks how many messages (or records) were processed during the invocation.
  • Error Metrics: In case of errors, it sends a separate metric to CloudWatch to track the occurrence of errors.
  • Custom Metrics: The metrics include ProcessingTime and ProcessedMessages, which are sent to a custom namespace in CloudWatch named 'WebScrapingService'. This allows you to separate and filter these metrics easily from other default metrics provided by AWS.

In these series of building a Scalable Web Scraping with Serverless, we've successfully set-up the infrastructure, processed data, and integrated sophisticated workflows using AWS services.

We've started from fetching web content and finished by storing structured data in PostgreSQL. This has not only demonstrated the power of serverless technologies but also underscored their efficiency in managing complex data operations.

Achievements and Reflections
  • Scalable Infrastructure: We've established a robust setup using AWS Lambda, SQS, S3, and RDS, which allows our system to handle varying loads efficiently. The serverless model has proven ideal for handling the elastic nature of web scraping tasks, where workloads can fluctuate dramatically.
  • Efficient Data Processing: By transforming HTML into JSON and storing it in a relational database, we've streamlined the flow of information. This structured approach enhances our ability to query and analyze the data, unlocking more value from the information we collect.
  • Workflow Orchestration: Utilizing SQS has enabled us to orchestrate the workflow seamlessly. This setup ensures that each component of our system works in harmony, maintaining a continuous and error-resilient operation.
  • Monitoring and Optimization: Implementing CloudWatch for monitoring and sending custom metrics has provided us with deep insights into our system’s performance. This data is crucial for ongoing optimization, allowing us to make informed decisions to enhance efficiency and reduce costs.
Future Directions

With the foundation in place, there are numerous pathways to expand and refine our system:

  • Advanced Data Analysis: Integrating AI and machine learning models to derive more complex insights from the scraped data could be a future step. For example, sentiment analysis or trend prediction models could provide additional layers of analysis for market research or competitive intelligence.
  • Enhanced Error Handling and Automation: Further automating the response to system alerts and errors through more sophisticated AWS Lambda functions could increase system robustness. Implementing intelligent retry mechanisms and more nuanced error processing can improve data integrity and reliability.
  • Cost Optimization: Continued monitoring of the system's cost efficiency will be essential as usage scales. Leveraging AWS cost management tools to fine-tune resource allocation and function execution can yield significant savings.
  • Security Enhancements: As the system scales, so does the need for robust security measures. Implementing stricter IAM roles, enhancing data encryption, and conducting regular security audits will ensure that the system remains secure against potential vulnerabilities.
Final Thoughts

The scalability, cost-effectiveness, and operational agility offered by serverless computing have made it an invaluable paradigm for modern software architectures, particularly for data-heavy applications like web scraping. As technology evolves, so will the opportunities to enhance these systems, making them smarter, faster, and more cost-effective.

Further Resources
  1. AWS Documentation:
  2. Serverless Framework Documentation:
  3. Educational Platforms:
    • Coursera and Udacity offer courses on cloud computing and serverless architectures that include hands-on projects and real-world applications.
    • A Cloud Guru: A dedicated platform for cloud computing learning, offering in-depth courses specifically on AWS services.
Community and Forums
  1. Serverless Stack (SST) Forum:
    • A vibrant community where you can ask questions, share knowledge, and learn from the experiences of other developers working with serverless technologies. Join the SST community.
  2. AWS re:Post & Stack Overflow:
    • Engage with the community on AWS re:Post for Q&A and discussions specific to AWS.
    • Stack Overflow has a rich set of discussions and solutions for serverless architecture challenges, available under tags like aws-lambda and amazon-s3.
Advanced Tools and Services
  1. Local Development and Testing:
    • LocalStack: A fully functional local AWS cloud stack that helps in developing cloud applications by testing them offline. LocalStack GitHub.
    • Serverless Offline: A Serverless Framework plugin that emulates AWS λ and API Gateway on your local machine to speed up your dev cycles. Serverless Offline Plugin.
  2. Performance Monitoring and Optimization:
    • Datadog: Provides cloud-scale monitoring that helps you track your serverless functions' performance and optimize resource usage.
    • New Relic: Offers deep performance analytics for every part of your software environment, including serverless functions.
  1. "Serverless Architectures on AWS" by Peter Sbarski: This book teaches you how to build, secure, and manage serverless architectures that can power the most demanding web and mobile apps.
  2. "AWS Lambda in Action" by Danilo Poccia: A comprehensive book on building applications with AWS Lambda, particularly useful for developers looking to expand beyond basic examples.

Article last update: May 13, 2024

Ethical Hacking

Frequently Asked Questions

The main goals are to transform HTML data stored in AWS S3 into JSON format, load this structured data into a PostgreSQL database, and optimize and monitor the entire process using cloud tools like AWS CloudWatch.

The cheerio library is recommended for parsing HTML content and transforming it into JSON format.

The Lambda function retrieves HTML files from S3 using the key provided in the SQS message, parses the HTML content with cheerio, extracts necessary information, and constructs JSON objects.

Amazon RDS hosts the PostgreSQL database where the transformed JSON data is stored, providing a scalable, secure, and managed database service.

Automation and orchestration are achieved by integrating SQS, which triggers each step of the process, ensuring a smooth data flow from HTML scraping to data storage.

Key considerations include configuring instance settings, enabling public accessibility if needed, ensuring security through VPC configuration, and modifying VPC security groups to allow inbound traffic on port 5432.

Cost-efficiency and high performance are ensured by monitoring serverless functions with AWS CloudWatch, optimizing SQS settings, and continuously adapting based on operational metrics.

SQS helps in decoupling components, handling spikes in workload, ensuring message delivery, and managing the flow of data processing tasks through queues and visibility timeout settings.

Errors can be managed by implementing dead letter queues (DLQ) for SQS to handle messages that cannot be processed, and by monitoring and adjusting settings based on CloudWatch insights.

Future directions include advanced data analysis using AI and machine learning, enhanced automation and error handling, continuous cost optimization, and strengthening security measures.

Latest Posts