Caching SPAs for SEO with Lamdba@Edge
Tournamentmgr.com's SEO performance has been historically poor from conception. When analyzing the number of pages google crawled, it averaged below forty per day.
Despite the fact that Tournament Manager hosts nearly ten thousand tournaments, and respectfully nearly that many unique pages, there is no differential between static pages and dynamic pages. This is due to the fact that the site runs as a single page javascript application, and its dynamic content rely solely on external API calls. After coming across a blog post, I was compelled to conduct research on this topic. The question - how can I build something that's dynamic to search engine bots, not intrusive to user experience, and budget friendly to a serverless based solution - became the center of my research.
In answering the previously stated question, I found many server-based tools, as well as some third party hosted utilities, but they either had an operational cost of maintenance, or an expense to the third party, both of which weren't options. Services like prerender.io, give you the ability to hold cache of up to 250 pages without cost, except with Tournament Manager's scale that's not ideal. Therefore, after going back to the drawing board I wrote a module (more on this later), and then containerized services to be deployed with AWS Fargate. Utilzing event-based and cron style, the services generate sitemaps, render pages, and then upload to an S3 Bucket.
Design
Pyppeteer is a python module that renders full pages, regardless of site speed. Utilizing this, I created two modules, prerender and scraper, of which read in the robots.txt file, discover sitemaps, and then loop through to determine the urls. Once complete, scraper will render these pages and return to you the static html, which can then be placed into an s3 bucket. Below you can see the sample implementation of this works from an invoction perspective.
# Example usage of https://github.com/danquack/Sitemap-Prerendering-S3 | |
from os import environ | |
from prerender.prerender import Prerender | |
client = Prerender(robots_url="https://tournamentmgr.com/robots.txt", | |
s3_bucket=environ['s3_bucket'], | |
auth=(environ.get('username', None), environ.get('password', None)), | |
query_char_deliminator=' ', | |
allowed_domains=['tournamentmgr.com', 'amazonaws.com']) | |
# Entire page | |
client.capture() | |
# individual page | |
client.capture_page_and_upload("https://tournamentmgr.com/home") |
Following the creation of the cached pages, I followed Jake Wells's blog, mentioned above, and created a simple origin request lambda@Edge function. Borrowing a prerender conditional, credited below, I was able to directionally route traffic to the cached version if originated from a webcrawler, and to the live site if from a typical user. To note: In order for this to work I had to modify CloudFront, which involved whitelisting the header, user-agent, and forwarding query strings.
// Credit: https://github.com/jinty/prerender-cloudfront | |
// Credit: https://aws.amazon.com/blogs/networking-and-content-delivery/dynamically-route-viewer-requests-to-any-origin-using-lambdaedge/ | |
exports.handler = (event, context, callback) => { | |
const request = event.Records[0].cf.request; | |
const headers = request.headers; | |
const user_agent = headers['user-agent']; | |
if (user_agent) { | |
var prerender = /googlebot|bingbot|yandex|baiduspider|Facebot|facebookexternalhit|twitterbot|rogerbot|linkedinbot|embedly|quora link preview|showyoubot|outbrain|pinterest|slackbot|vkShare|W3C_Validator/i.test(user_agent[0].value); | |
prerender = prerender || /_escaped_fragment_/.test(request.querystring); | |
prerender = prerender && ! /\.(css|xml|less|png|jpg|jpeg|gif|pdf|doc|txt|ico|rss|zip|mp3|rar|exe|wmv|doc|avi|ppt|mpg|mpeg|tif|wav|mov|psd|ai|xls|mp4|m4a|swf|dat|dmg|iso|flv|m4v|torrent|ttf|woff|svg|eot)$/i.test(request.uri); | |
if (prerender) { | |
// if a query string provided remove the query string and replace the ? with a + (translated to space with S3) | |
if (request.querystring) { | |
request.uri += `+${request.querystring}`; | |
request.querystring = '' | |
} | |
// Return the prerendering bucket to S3 | |
const s3DomainName = process.env.s3DomainName; | |
request.origin = { | |
s3: { | |
domainName: s3DomainName, | |
region: '', | |
authMethod: 'none', | |
path: '', | |
customHeaders: {} | |
} | |
}; | |
request.headers['host'] = [{ key: 'host', value: s3DomainName }]; | |
console.log(user_agent[0].value, "requesting", request.uri); | |
} | |
} | |
callback(null, request); | |
}; |
Implementation
Utilizing Fargate, an AWS containerization clustering service, I created two task definitions. These definitions included an internal service that interpreted the DynamoDB database into valid XML URLs, and a prerender service utilizing the aforementioned Sitemap Prerender application. Finally, I built a serverless framework wrapper for CloudFormation to deploy my resources. As seen below this does several things, including creating the IAM roles, cluster for the services, and the task definitions.
--- | |
service: seo | |
provider: | |
name: aws | |
stage: ${opt:stage, 'dev'} | |
resources: | |
Resources: | |
PrerenderRepository: | |
Type: AWS::ECR::Repository | |
Properties: | |
RepositoryName: ${self:service}-prerender-${self:provider.stage} | |
Cluster: | |
Type: AWS::ECS::Cluster | |
Properties: | |
ClusterName: ${self:service}-${self:provider.stage} | |
PreRender: | |
Type: AWS::ECS::TaskDefinition | |
Properties: | |
Family: ${self:service}-prerender-${self:provider.stage} | |
Memory: 2048 | |
Cpu: 512 | |
RequiresCompatibilities: | |
- FARGATE | |
TaskRoleArn: | |
Ref: TaskExecutionRole | |
ExecutionRoleArn: | |
Ref: TaskExecutionRole | |
NetworkMode: awsvpc | |
ContainerDefinitions: | |
- Name: ${self:service}-prerender-${self:provider.stage} | |
Image: | |
Fn::Join: | |
- "" | |
- - Ref: AWS::AccountId | |
- ".dkr.ecr." | |
- Ref: AWS::Region | |
- ".amazonaws.com/${self:service}-prerender-${self:provider.stage}:latest" | |
LogConfiguration: | |
LogDriver: awslogs | |
Options: | |
awslogs-group: | |
Ref: PrerenderLogs | |
awslogs-region: | |
Ref: AWS::Region | |
awslogs-stream-prefix: | |
Fn::Sub: "${self:service}-prerender-${self:provider.stage}" | |
EntryPoint: | |
- "python" | |
Command: | |
- "index.py" | |
Environment: | |
- Name: AWS_DEFAULT_REGION | |
Value: | |
Ref: AWS::Region | |
- Name: AWS_ACCESS_KEY_ID | |
Value: ${env:AWS_ACCESS_KEY_ID} | |
- Name: AWS_SECRET_ACCESS_KEY | |
Value: ${env:AWS_SECRET_ACCESS_KEY} | |
- Name: ENVIRONMENT | |
Value: ${self:provider.stage} | |
- Name: robots_url | |
Value: "https://tournamentmgr.com/robots.txt" | |
- Name: s3_bucket | |
Value: "tm-prerender" | |
PrerenderLogs: | |
Type: AWS::Logs::LogGroup | |
Properties: | |
LogGroupName: /ecs/${self:service}-prerender-${self:provider.stage} | |
RetentionInDays: 5 | |
TaskExecutionRole: | |
Type: AWS::IAM::Role | |
Properties: | |
AssumeRolePolicyDocument: | |
Statement: | |
- Effect: Allow | |
Principal: | |
Service: | |
- ecs-tasks.amazonaws.com | |
Action: | |
- 'sts:AssumeRole' | |
Policies: | |
- PolicyName: TaskExecutionRole | |
PolicyDocument: | |
Statement: | |
- Effect: Allow | |
Action: | |
- 'ecr:GetAuthorizationToken' | |
- 'ecr:BatchCheckLayerAvailability' | |
- 'ecr:GetDownloadUrlForLayer' | |
- 'ecr:BatchGetImage' | |
- 'logs:CreateLogStream' | |
- 'logs:PutLogEvents' | |
Resource: '*' | |
- Effect: Allow | |
Action: | |
- 's3:PutObject' | |
Resource: | |
- "arn:aws:s3:::tm-prerender/*" |
Results
The results speak for themselves. After implementing the sitemap, rendering the dynamic pages, and routing bot-based traffic to the new urls, crawable addresses increased by 900% going from 40 to 400. Another benefit also seen is with open graph. Due to the fact the pages were static, when users would share their tournaments, tournament metadata and images would be lost. With this modification, users can now share their tournaments on any major platforms, and get their own custom icons and information to the users, better connecting them to their tournaments.