Today I'll show you how to build a scalable infrastructure for web scraping using Serverless Framework.
Serverless computing offers a compelling solution by eliminating the need to manage infrastructure, allowing developers to focus solely on code. This model is perfect for web scraping tasks that vary in intensity and frequency, as it scales automatically to meet demand without any manual intervention. Moreover, serverless functions are cost-effective, as you only pay for the compute time you consume, making it an ideal choice for both small-scale projects and large-scale data extraction tasks.
In this blog series, we will explore how to leverage serverless technologies to build a robust and scalable web scraping infrastructure. We'll use a suite of AWS services including Lambda, S3, SQS, and RDS, combined with popular Node.js libraries like node-fetch for fetching data, cheerio for parsing HTML, and node-postgres to interact with databases.
Our journey will be split into two parts:
Serverless computing is a cloud computing execution model where the cloud provider manages the setup, capacity planning, and server management for you. Essentially, it allows developers to write and deploy code without worrying about the underlying infrastructure. In a serverless setup, you only pay for the compute time you consume—there is no charge when your code is not running.
Serverless is particularly adept at handling event-driven applications—like those triggered by a scheduled time to scrape data or a new item appearing in a queue. This fits naturally with the often sporadic and on-demand nature of web scraping, where workloads can be highly variable and unpredictable.
While AWS Lambda is one of the most popular serverless computing services, other cloud providers offer similar functionalities:
Each platform has its own strengths and pricing models, but AWS Lambda will be our focus due to its maturity, extensive documentation, and seamless integration with other AWS services like S3 and SQS.
After gaining a foundational understanding of serverless architecture, our next step involves setting up the Serverless Framework. This powerful framework simplifies deploying serverless applications and managing their lifecycle. In this section, we'll cover how to configure the Serverless Framework to work seamlessly with AWS services, a crucial component for our scalable web scraping infrastructure.
The Serverless Framework is an open-source CLI that provides developers with a streamlined workflow for building and deploying serverless applications. It abstracts much of the complexity associated with configuring various cloud services, making it easier to launch applications across different cloud providers like AWS, Azure, and Google Cloud.
To begin, you need to have Node.js installed on your machine as the Serverless Framework runs on it. Once Node.js is set up, you can install the Serverless Framework globally using npm:
1npm install -g serverless
Or, if you prefer using Yarn:
1yarn global add serverless
To deploy functions to AWS Lambda and manage resources, you need to configure your AWS credentials in the Serverless Framework:
1serverless config credentials --provider aws --key YOUR_ACCESS_KEY_ID --secret YOUR_SECRET_ACCESS_KEY
This command stores your credentials in a local file, which the Serverless Framework uses to interact with your AWS account.
Every Serverless Framework project contains a serverless.yml file at its root. This file is crucial as it defines the service configuration, including functions, events, and resources. Here’s a basic setup for our web scraping project:
1service: web-scraping-service
2
3provider:
4 name: aws
5 runtime: nodejs14.x
6 region: us-east-1
7
8functions:
9 scrapeSite:
10 handler: handler.scrape
11 events:
12 - schedule:
13 rate: cron(0 */4 * * ? *) # Runs every 4 hours
In this configuration:
To fully harness the power of serverless architecture for our web scraping infrastructure, integrating Amazon Simple Queue Service (SQS) is essential. SQS will manage the messages related to tasks such as notifying our system when new data is available for processing. Here's how to extend our existing serverless.yml configuration to include SQS resources:
SQS queues can be defined and managed directly within the serverless.yml file, which allows for seamless integration and management within our AWS ecosystem. Let’s add two SQS queues to our setup—one for HTML files and another for JSON data:
1service: web-scraping-service
2
3provider:
4 name: aws
5 runtime: nodejs14.x
6 region: us-east-1
7 iamRoleStatements:
8 - Effect: Allow
9 Action:
10 - sqs:SendMessage
11 - sqs:ReceiveMessage
12 - sqs:DeleteMessage
13 - sqs:GetQueueAttributes
14 Resource: "*"
15
16resources:
17 Resources:
18 HtmlQueue:
19 Type: AWS::SQS::Queue
20 Properties:
21 QueueName: htmlQueue
22
23functions:
24 scrapeSite:
25 handler: handler.scrape
26 events:
27 - schedule:
28 rate: cron(0 */4 * * ? *) # Runs every 4 hours
29 environment:
30 HTML_QUEUE_URL: !Ref HtmlQueue
Explanation of the Configuration:
With the Serverless Framework configured and ready, the next step in building our scalable web scraping infrastructure is to develop the first AWS Lambda function. This function will be responsible for fetching web data periodically based on the sitemap of a target website. Let's dive into creating and deploying this essential component.
AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the compute resources for you. Lambda is ideal for handling tasks that respond to HTTP requests, process queue messages, or, as in our case, execute tasks on a schedule.
1serverless create --template aws-nodejs --path my-web-scraper
2cd my-web-scraper
This command sets up a basic Node.js project with a serverless.yml file and a sample handler function in handler.js.
1npm install node-fetch xml2js
Create a new function in the handler.js file or whatever your main file is named. This function will handle the fetching of the sitemap and parsing it to get URLs:
1exports.scrape = async () => {
2 try {
3 const sitemapUrl = 'https://example.com/sitemap.xml';
4 const response = await fetch(sitemapUrl);
5 if (!response.ok) {
6 throw new Error(`Failed to fetch sitemap: ${response.statusText}`);
7 }
8 const sitemapXml = await response.text();
9
10 const parsedSitemap = await xml2js.parseStringPromise(sitemapXml);
11 const urls = parsedSitemap.urlset.url.map(u => u.loc[0]);
12
13 for (const url of urls) {
14 const params = {
15 MessageBody: JSON.stringify({ url }),
16 QueueUrl: process.env.HTML_QUEUE_URL,
17 };
18
19 await sqs.sendMessage(params).promise();
20 }
21 } catch (error) {
22 console.error('Error during scraping process:', error);
23 // Optionally rethrow or handle error specifically (e.g., retry logic, notification)
24 }
25};
To enhance the reliability of your Lambda function, especially in a web scraping context where external dependencies like network issues or site changes can lead to errors, error handling is a must. Here’s how you can implement comprehensive error handling in your serverless function:
Integrate error monitoring and logging tools to capture errors for further analysis. Tools like AWS CloudWatch can be configured to monitor logs, which helps in diagnosing issues after deployment.
1provider:
2 name: aws
3 runtime: nodejs14.x
4 region: us-east-1
5 iamRoleStatements:
6 # IAM permissions here
7 logs:
8 restApi:
9 loggingLevel: ERROR
10 fullExecutionData: true
Configure Dead Letter Queues (DLQ) in SQS for messages that cannot be processed after several attempts. This approach helps in isolating problematic messages and prevents them from clogging your processing pipeline.
1resources:
2 Resources:
3 HtmlQueue:
4 Type: AWS::SQS::Queue
5 Properties:
6 QueueName: htmlQueue
7 RedrivePolicy:
8 deadLetterTargetArn: !GetAtt HtmlDeadLetterQueue.Arn
9 maxReceiveCount: 3 # Adjust based on your needs
10
11 HtmlDeadLetterQueue:
12 Type: AWS::SQS::Queue
13 Properties:
14 QueueName: htmlDeadLetterQueue
Properly manage timeouts and memory settings in your Lambda function to prevent unexpected terminations or performance issues.
1functions:
2 scrapeSite:
3 handler: handler.scrape
4 timeout: 30 # seconds
5 memorySize: 256 # MB
6 events:
7 - schedule:
8 rate: cron(0 */4 * * ? *) # Runs every 4 hours
Before deploying your function, test it locally or in a development environment to ensure it behaves as expected. The Serverless Framework provides commands for invoking functions locally and deploying them to AWS.
1serverless invoke local --function scrapeSite
1serverless deploy
After successfully setting up and deploying our Lambda function to scrape web data, the next crucial step in our serverless web scraping pipeline involves efficiently storing the retrieved data. Amazon S3 (Simple Storage Service) offers a robust solution for this purpose, providing scalability, data availability, security, and performance. This section will guide you through setting up S3 buckets and configuring your Lambda function to save scraped data directly to S3.
Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Some key benefits include:
Before you can store data, you need to create an S3 bucket in your AWS account. Each bucket's name must be unique across all existing bucket names in Amazon S3 (bucket names are shared among all users globally).
1service: web-scraping-service
2
3provider:
4 name: aws
5 runtime: nodejs14.x
6 region: us-east-1
7 iamRoleStatements:
8 - Effect: "Allow"
9 Action:
10 - "s3:ListBucket"
11 Resource: "arn:aws:s3:::web-scraping-data-yourname"
12 - Effect: "Allow"
13 Action:
14 - "s3:PutObject"
15 - "s3:GetObject"
16 Resource: "arn:aws:s3:::web-scraping-data-yourname/*"
17
18resources:
19 Resources:
20 WebScrapingDataBucket:
21 Type: AWS::S3::Bucket
22 Properties:
23 BucketName: web-scraping-data-yourname
24
25 BucketPolicy:
26 Type: AWS::S3::BucketPolicy
27 Properties:
28 Bucket:
29 Ref: WebScrapingDataBucket
30 PolicyDocument:
31 Statement:
32 - Effect: "Allow"
33 Principal:
34 AWS:
35 Fn::GetAtt: [LambdaExecutionRole, Arn]
36 Action:
37 - "s3:PutObject"
38 - "s3:GetObject"
39 Resource:
40 Fn::Join:
41 - ""
42 - - "arn:aws:s3:::"
43 - Ref: WebScrapingDataBucket
44 - "/*"
Explanation:
Here is how you could configure it in the serverless.yml:
1functions:
2 fetchHtmlContent:
3 handler: fetchHtml.handler
4 events:
5 - sqs:
6 arn:
7 Fn::GetAtt:
8 - HtmlQueue
9 - Arn
10 environment:
11 S3_BUCKET: web-scraping-data-yourname
12
13resources:
14 Resources:
15 HtmlQueue:
16 Type: AWS::SQS::Queue
17 Properties:
18 QueueName: htmlQueue
The function will use node-fetch to fetch the web content.
1const AWS = require('aws-sdk');
2const fetch = require('node-fetch');
3const s3 = new AWS.S3();
4const proxyList = ['http://proxy1.com:port', 'http://proxy2.com:port']; // Example proxy list
5
6exports.handler = async (event) => {
7 for (const record of event.Records) {
8 const { url } = JSON.parse(record.body);
9 const proxy = proxyList[Math.floor(Math.random() * proxyList.length)];
10 const proxyOptions = {
11 headers: {
12 'Proxy-Authorization': 'Basic ' + Buffer.from('user:password').toString('base64'), // if authentication is needed
13 }
14 };
15
16 try {
17 const response = await fetch(url, { agent: new HttpsProxyAgent(proxy), ...proxyOptions });
18 if (!response.ok) {
19 throw new Error(`HTTP error! status: ${response.status}`);
20 }
21 const body = await response.text();
22
23 const params = {
24 Bucket: process.env.S3_BUCKET,
25 Key: `${new Date().getTime()}.html`,
26 Body: body,
27 ContentType: 'text/html',
28 };
29
30 await s3.upload(params).promise();
31 console.log('HTML uploaded successfully:', params.Key);
32 } catch (error) {
33 console.error('Error fetching and uploading HTML:', error);
34 }
35 }
36};
We've taken significant steps in establishing a robust and scalable serverless web scraping system using AWS technologies and the Serverless Framework. By carefully configuring and deploying our serverless functions, we are well on our way to creating a fully automated data extraction pipeline that leverages the cloud's power for efficiency and scalability.
In Part 2 of this series, we will delve deeper into transforming the HTML data into a more structured format (JSON), which can be used for various analytical purposes. We will implement additional Lambda functions to process the HTML data stored in S3, transform it into JSON, and finally store this structured data in a PostgreSQL database. This will complete our end-to-end data processing pipeline, showcasing how serverless technologies can be effectively utilized for complex data processing tasks in web scraping.
Article last update: May 13, 2024