Today we will focus on processing the HTML data stored in AWS S3, transforming it into JSON format, and then loading this data into a PostgreSQL database. This will complete the pipeline by adding data processing and storage capabilities, making your system capable of handling end-to-end data operations in a scalable and efficient manner.
In the first part of our series on building a scalable serverless web scraping system, we successfully set up the foundational components. We created AWS Lambda functions to fetch sitemap.xml, extract URLs, and store HTML content in Amazon S3, orchestrated by SQS for seamless message handling. Having established a robust infrastructure for scraping and initial data storage, it's time to move forward with the next stages of our data pipeline.
In Part 2 of this series, we will focus on transforming the HTML content stored in S3 into a more structured JSON format. This transformation is crucial as it facilitates easier data manipulation and integration with various analytics tools and databases. We will then detail how to load this structured data into a PostgreSQL database hosted on Amazon RDS, effectively completing our end-to-end serverless web scraping solution.
The process of transforming the HTML to JSON is essential as it allows us to extract relevant data from the raw HTML and prepare it for analytical tasks or storage in a database. In this section, we'll cover the setup of a Lambda function that fetches HTML from S3, parses it using cheerio, and transforms it into JSON.
Sample Code for the Lambda Function
Here is a basic example of what the Lambda function’s code might look like:
1const AWS = require('aws-sdk');
2const s3 = new AWS.S3();
3const cheerio = require('cheerio');
4
5exports.handler = async (event) => {
6 for (const record of event.Records) {
7 const { bucketName, objectKey } = JSON.parse(record.body);
8
9 // Fetch the HTML file from S3
10 const htmlData = await s3.getObject({
11 Bucket: bucketName,
12 Key: objectKey
13 }).promise();
14
15 // Load HTML into Cheerio
16 const $ = cheerio.load(htmlData.Body.toString('utf-8'));
17
18 // Extract relevant data
19 const pageData = {
20 title: $('title').text(),
21 headings: $('h1, h2, h3').map((i, el) => $(el).text()).get(),
22 links: $('a').map((i, el) => $(el).attr('href')).get()
23 };
24
25 // Construct JSON object
26 const jsonData = JSON.stringify(pageData);
27
28 // Additional code to handle JSON data as needed
29 console.log(jsonData);
30 }
31};
To incorporate the Lambda function for transforming HTML into JSON into your serverless.yml configuration, you'll need to define the function, set its trigger, and ensure it has the necessary permissions to access AWS S3 and SQS. Here's a detailed step-by-step guide to updating your serverless.yml file:
1service: web-scraping-service
2
3provider:
4 name: aws
5 runtime: nodejs14.x # Specify your Node.js version
6 region: us-east-1
7 iamRoleStatements:
8 - Effect: Allow
9 Action:
10 - s3:GetObject
11 - s3:PutObject
12 Resource:
13 - arn:aws:s3:::your-html-bucket-name/*
14 - arn:aws:s3:::your-json-bucket-name/*
15 - Effect: Allow
16 Action:
17 - sqs:ReceiveMessage
18 - sqs:DeleteMessage
19 - sqs:GetQueueAttributes
20 Resource:
21 - !GetAtt HtmlQueue.Arn
22
23functions:
24 transformHtmlToJson:
25 handler: transformHtmlToJson.handler # Update with your actual handler file and function
26 events:
27 - sqs:
28 arn: !GetAtt HtmlQueue.Arn
29 environment:
30 HTML_BUCKET_NAME: your-html-bucket-name
31 JSON_BUCKET_NAME: your-json-bucket-name
32
33resources:
34 Resources:
35 HtmlQueue:
36 Type: AWS::SQS::Queue
37 Properties:
38 QueueName: htmlQueue
39 JsonQueue:
40 Type: AWS::SQS::Queue
41 Properties:
42 QueueName: jsonQueue
Key Components of the Configuration
After transforming HTML content into structured JSON, the next step in our serverless web scraping infrastructure involves storing this data in a PostgreSQL database. Using Amazon RDS (Relational Database Service) to host PostgreSQL provides a scalable, secure, and managed database service that integrates seamlessly with AWS Lambda. This section will guide us through setting up a PostgreSQL instance, creating a Lambda function to load JSON data into the database, and ensuring smooth data flow and integrity.
For database insertions, we will define a new Lambda function in serverless.yml that will be triggered by new messages in the SQS queue, after the JSON transformation process.
Sample Code for the Lambda Function
Here is a basic example of what the Lambda function’s code might look like:
1const { Pool } = require('pg');
2const AWS = require('aws-sdk');
3const s3 = new AWS.S3();
4
5// Configure PostgreSQL connection using environment variables
6const pool = new Pool({
7 user: process.env.PG_USER,
8 host: process.env.PG_HOST,
9 database: process.env.PG_DATABASE,
10 password: process.env.PG_PASSWORD,
11 port: process.env.PG_PORT,
12});
13
14exports.handler = async (event) => {
15 for (const record of event.Records) {
16 const objectKey = JSON.parse(record.body).objectKey;
17 const bucketName = 'your-json-bucket-name';
18
19 // Fetch JSON from S3
20 const data = await s3.getObject({
21 Bucket: bucketName,
22 Key: objectKey
23 }).promise();
24
25 const jsonData = JSON.parse(data.Body.toString('utf-8'));
26
27 // Insert data into PostgreSQL
28 const client = await pool.connect();
29 try {
30 await client.query('INSERT INTO your_table (data_field) VALUES ($1)', [jsonData]);
31 } catch (err) {
32 console.error('Error inserting data into PostgreSQL', err);
33 } finally {
34 client.release();
35 }
36 }
37};
Modifying the serverless.yml Configuration
First, define the environment variables in your serverless.yml file under the provider or function specific environment section:
1provider:
2 name: aws
3 environment:
4 PG_USER: ${ssm:/myapp/prod/db/username}
5 PG_HOST: ${ssm:/myapp/prod/db/host}
6 PG_DATABASE: ${ssm:/myapp/prod/db/database}
7 PG_PASSWORD: ${ssm:/myapp/prod/db/password}
8 PG_PORT: 5432
9 iamRoleStatements:
10 - Effect: "Allow"
11 Action:
12 - "secretsmanager:GetSecretValue"
13 - "ssm:GetParameter"
14 Resource: "*"
In this configuration, ${ssm:/myapp/prod/db/username} refers to a parameter stored in the AWS Systems Manager Parameter Store, which is a secure place to store configuration data and secrets.
Then, add define the load function:
1functions:
2 loadDataToPostgreSQL:
3 handler: handlers.loadDataToPostgreSQL
4 events:
5 - sqs:
6 arn: !GetAtt JsonQueue.Arn
7 environment:
8 S3_BUCKET_NAME: your-json-bucket-name
Key Components of the Configuration:
In our serverless web scraping system, Amazon Simple Queue Service (SQS) plays a crucial role in orchestrating the workflow between different components. By adopting SQS, we can efficiently manage the communication and data flow between our Lambda functions, ensuring a seamless and scalable operation. In this section we will discuss how we utilize SQS to coordinate the data processing steps from scraping HTML to transforming it into JSON, and finally storing it in PostgreSQL.
SQS offers a robust, fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. Using SQS, you can send, store, and receive messages between software components without losing messages or requiring other services to be available. Here’s how SQS benefits our application:
The following example shows how to set up a Lambda function triggered by an SQS queue with a specific batch size, visibility timeout, and a dead letter queue for handling message processing failures:
1service: web-scraping-service
2
3provider:
4 name: aws
5 runtime: nodejs14.x
6
7functions:
8 processHtml:
9 handler: handlers.processHtml
10 events:
11 - sqs:
12 arn: !GetAtt HtmlQueue.Arn
13 batchSize: 10
14 maximumRetryAttempts: 3
15 visibilityTimeout: 300
16
17resources:
18 Resources:
19 HtmlQueue:
20 Type: AWS::SQS::Queue
21 Properties:
22 QueueName: htmlQueue
23
24 HtmlDeadLetterQueue:
25 Type: AWS::SQS::Queue
26 Properties:
27 QueueName: htmlDeadLetterQueue
28
29 HtmlQueueDLQPolicy:
30 Type: AWS::SQS::QueuePolicy
31 Properties:
32 Queues:
33 - !Ref HtmlDeadLetterQueue
34 PolicyDocument:
35 Statement:
36 - Effect: Allow
37 Principal: "*"
38 Action:
39 - sqs:SendMessage
40 Resource: !GetAtt HtmlDeadLetterQueue.Arn
41 Condition:
42 ArnEquals:
43 "aws:SourceArn": !GetAtt HtmlQueue.Arn
The following example demonstrates how you can modify a Lambda function's code to send custom metrics to AWS CloudWatch using the AWS SDK for Node.js. This can be very useful for tracking custom operational metrics like function execution times, the number of processed items, errors, or other specific metrics that help in monitoring and optimizing your serverless application.
First, add the appropriate IAM Role statement to serverless.yml:
1provider:
2 name: aws
3 runtime: nodejs14.x
4 iamRoleStatements:
5 - Effect: Allow
6 Action:
7 - cloudwatch:PutMetricData
8 Resource: "*"
9 environment:
10 CLOUDWATCH_NAMESPACE: 'WebScrapingService'
Use putMetricData to send metrics to CloudWatch:
1const AWS = require('aws-sdk');
2// Instantiate CloudWatch client
3const cloudwatch = new AWS.CloudWatch({ apiVersion: '2010-08-01' });
4
5exports.handler = async (event) => {
6 const startTime = new Date(); // Record start time for execution timing
7
8 try {
9 // Example process: Parse messages from SQS event
10 const numMessages = event.Records.length;
11
12 // Simulate some processing logic
13 event.Records.forEach(record => {
14 console.log('Processing record:', record.messageId);
15 // Further processing here...
16 });
17
18 // Calculate processing time
19 const endTime = new Date();
20 const processingTime = endTime - startTime;
21
22 // Send custom metric for processing time
23 await cloudwatch.putMetricData({
24 Namespace: 'WebScrapingService', // Custom namespace
25 MetricData: [
26 {
27 MetricName: 'ProcessingTime',
28 Dimensions: [
29 {
30 Name: 'FunctionName',
31 Value: 'processHtml'
32 }
33 ],
34 Unit: 'Milliseconds',
35 Value: processingTime
36 },
37 {
38 MetricName: 'ProcessedMessages',
39 Dimensions: [
40 {
41 Name: 'FunctionName',
42 Value: 'processHtml'
43 }
44 ],
45 Unit: 'Count',
46 Value: numMessages
47 }
48 ]
49 }).promise();
50
51 console.log('Metrics pushed to CloudWatch');
52 } catch (error) {
53 console.error('Error processing messages:', error);
54
55 // Send error metric to CloudWatch
56 await cloudwatch.putMetricData({
57 Namespace: 'WebScrapingService',
58 MetricData: [
59 {
60 MetricName: 'Errors',
61 Dimensions: [
62 {
63 Name: 'FunctionName',
64 Value: 'processHtml'
65 }
66 ],
67 Unit: 'Count',
68 Value: 1
69 }
70 ]
71 }).promise();
72
73 throw error; // Re-throw error to be handled by Lambda or trigger DLQ
74 }
75};
Key Elements in the Code:
In these series of building a Scalable Web Scraping with Serverless, we've successfully set-up the infrastructure, processed data, and integrated sophisticated workflows using AWS services.
We've started from fetching web content and finished by storing structured data in PostgreSQL. This has not only demonstrated the power of serverless technologies but also underscored their efficiency in managing complex data operations.
With the foundation in place, there are numerous pathways to expand and refine our system:
The scalability, cost-effectiveness, and operational agility offered by serverless computing have made it an invaluable paradigm for modern software architectures, particularly for data-heavy applications like web scraping. As technology evolves, so will the opportunities to enhance these systems, making them smarter, faster, and more cost-effective.
Article last update: May 13, 2024