Analyze Cloudfront Logs in S3


SpectX gets you instant insights into your web resources served by Cloudfront. Download the free desktop edition to your computer to parse and query the S3 bucket storing your log files. Follow the steps below and run the query examples to get a general overview or specific questions answered in minutes. The example queries are also included in the SpectX Github query pack.

AWS Cloudfront is a content delivery network (CDN) that globally caches your digital content (data, videos, applications, and APIs) and helps to speed up their delivery anywhere in the world. Cloudfront logs give you detailed records about each request, whether it is coming from a user or a service. The information is pretty similar to a web server access log but instead of just one server, the logs originate from multiple edge locations, which are closer to your users and cache your content. Analyzing the logs will give you answers to questions like who/what requested what when and where (at which edge location), were they successful, did they receive any errors, were the users (likely) human or bots, how many bytes were transferred, how are the edge locations performing, which edges are under heavier load, and so on.

Note that Amazon delivers access logs on a best-effort basis and therefore recommends that you use the logs to understand the nature of the requests rather than a complete accounting of all requests. The log entry for a particular request might be delivered long after the request was actually processed and, in rare cases, a log entry might not be delivered at all.

Connect SpectX to Your Cloudfront Bucket 

To query Cloudfront logs with SpectX, follow these steps:

  1. Download and install Free SpectX
  2. To connect SpectX’ to S3 bucket containing Cloudfront logs
    1. Press ‘New’ > ‘Data Store’ 
    2. Pick where you want to save datastore link in the SpectX resource tree (User / Shared / System)
    3. Fill in the information about your bucket, click Create and close the Data Store window.
Create a datastore to connect to your Cloudfront logs in S3

A note about caching. Check the ‘is cacheable’ box if you wish to cache the data in the machine running SpectX and not retrieve it from S3 with each query. For example, if your query goes back 2 weeks but you’ve set the hot cache period to 1 week, only 1 week of the latest data is cached.  If you leave the hot cache period to default (0), SpectX will cache all the data it reads for a query. 

3. To get started with the queries, copy-paste this base query from Github to SpectX query window. 

4. To point the query to your logs, replace the prefix value in the init block (line 2 of the base query). The prefix is the first part of the path until the timestamp. To get the prefix value, click on the S3 datastore you just created in the resource tree, navigate to one of the log files and copy the path from SpectX data browser window. For example, the full path to one of your Cloudfront log files is:
  s3://bucket/path/path2/E12QSKSMO12A37.2020-03-10-12.9398603f.gz
Then the prefix value is:
  s3://bucket/path/path2/E12QSKSMO12A37
5. The base query looks back at the last 6 days of logs. To change the period, modify the timestamps in the init block (lines 3-4 in the base query)

6. Press Run. You should see results similar to this:


What to look for in Cloudfront logs? The Queries.

Cloudfront logs, like any other logs, contain a lot of garbage. The first thing to do is to clean the dataset a bit to focus on what’s interesting. For example, if you were interested in the behaviour of (human) users, then a robust way to filter out automated requests is to filter out some irrelevant user agents. To do this, paste the following query at the end of the base query:

| select(user_agent, cnt:count(*))
| group(@1)
| sort(cnt DESC)
Next, hold down the Ctrl (or Cmd for OSX) key and pick the cells that contain values you consider irrelevant. Then, right click on one of the highlighted rows and pick filter_out. A similar query line gets added to the script:

| filter_out(user_agent IN ('Zabbix', 'Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)', 'curl/7.61.1'))
Another, bit more sophisticated way of filtering out the bots is to look at not only the user agent but also the IP (or the AS name if you’ve configured SpectX access to Maxmind’s databases). This can reveal interesting records that seem to be made by a ‘boring’ user agent but the IP is quite interesting. Again, this is relative and requires some knowledge of the context. To use the following query, connect SpectX to Maxmind GeoLite2 databases.
| select(user_agent, ip, cnt:count(*))
| select(user_agent, asname(ip), cnt:count(*))
| group(@1, @2)
| sort(cnt DESC)
The third option: look at unique IPs making hits during fixed time intervals. If we look at unique IPs making hits in 1-hour intervals and we get hits from a single IP during more than 16 different intervals during a 24-hour time period, it is likely not human. There are corner cases, e.g. proxy servers but in addition to the IP, you can also take a look at unique ip&user_agent, ip&cookie pairs, etc.

| select(truncTime:timestamp[1 min], ip, count(*))
| group(truncTime, ip)
| select(ip, timeUntis:count(*), hits:sum_long(count))
| group(ip)
| sort(timeUntis desc)
Here are 15 more queries to copy-paste to the end of the base query for insights.

1) What is the time span of this dataset? Earliest timestamp, latest timestamp, span in seconds?
| select(earliest:min(timestamp), latest:max(timestamp))
| select(*, span_seconds:(latest-earliest)/1000)

2) How many requests returned the status code 200?
| filter(status = 200)
| select(count(*))

3) Count the number of hits in status ranges
| select(
  total:count(*),
  status_200:count(status=200),
  status_ranges:{
  status_1xx:count(status between 100 and 199),
  status_2xx:count(status between 200 and 299),
  status_3xx:count(status between 300 and 399),
  status_4xx:count(status between 400 and 499),
  status_5xx:count(status between 500 and 599)
  }
)

4) How many requests could not be satisfied by an object in the edge cache?
| select(
  total:count(*),
  Hit:count(result_type = 'Hit'),
  Miss:count(result_type = 'Miss'),
  Redirect:count(result_type = 'Redirect'),
  Error:count(result_type = 'Error')
)

5) In 1 minute intervals, how many requests could not be satisfied by an object in the edge cache?
| select(
  time:timestamp[1 min],
  Hit:count(result_type = 'Hit'),
  Miss:count(result_type = 'Miss'),
  Redirect:count(result_type = 'Redirect'),
  Error:count(result_type = 'Error')
)
| group(time)

6) Which URIs are generating the most error messages?
| filter(result_type = 'Error')
.select(uri, cnt:count(*))
| group(@1)
| sort(cnt DESC)
| limit(100)

7) How many requests served the object to the viewer from the edge cache?
| select(
  total:count(*),
  Hit:count(result_type = 'Hit'),
  Miss:count(result_type = 'Miss'),
  Redirect:count(result_type = 'Redirect'),
  Error:count(result_type = 'Error')
)

8) How many total bytes were served and received?
| select(total_bytes_Cloudront_served:sum_long(bytes),   //The total number of bytes that CloudFront served to the viewer in response to the request
total_bytes_from_viewer:sum_long(cs_bytes)) //number of bytes of data that the viewer included in the request, including headers.

9) Per-minute, how many total bytes were served and received?
| select(time:timestamp[1 min], Cloudfront_served:sum(bytes), Viewer_received:sum(cs_bytes))
| group(time)
Press Chart to see the trend on a graph.

10) Which edge had the most number of requests?
| select(location, count(*))
| group(location)
| sort(count desc)

11) Which edge had the most number of requests, including country and city?
This query joins the AWS edge locations published by CloudPing.cloud with the results of your query. https://www.cloudping.cloud
//paste this after the PATTERN section of the script

$location_code(location)=substr($location, 0, 3);
@edges=   LIST('https://cdn.jsdelivr.net/gh/ft6/cloud.feitsui.com/cloudfront-edge-locations.json')
| parse(pattern:"JSON:j EOL")
| select(node:unnest(VARIANT_OBJECT(j[nodes])))
| select(location_code:node[key], country:STRING(node[value][country]), city:STRING(node[value][city]))
| group(location_code) // make sure we have no dups.
;

//paste this to the end of the script:
| select(*, $location_code(location))
| leftjoin(@edges on location_code)
| select(country,city,*)
| unselect(location_code, right_location_code)

12) Which edges in which cities transferred the most bytes?
//keep the script you pasted after the PATTERN in 7a) and paste this to the end of the query
| select(*, $location_code(location))
| leftjoin(@edges on location_code)
| select(city, transferred:sum_long(bytes))
| group(city)
| sort(transferred DESC)
// | limit(9,1) // skip 9, keep 1 - 10th

13) Which edge had the most number of unique IP addresses?
| select(location, ip, count(*))
| group(location, ip)
| select(location, uniq_ips:count(*), total_hits:sum_long(count))
| group(location)
| sort(uniq_ips desc)

14) Which edge has the lowest average latency?
Time_taken is the number of seconds (to the thousandth of a second, for example, 0.002) between the time that a CloudFront edge server receives a viewer's request and the time that CloudFront writes the last byte of the response to the edge server's output queue as measured on the server.
//| filter(uri like '/dist/img/%.jpg') // only for dist images
| select(location, count(*), min(time_taken), max(time_taken), avg(time_taken), STDDEV(time_taken))
| group(location)
| sort(avg_time_taken)

15) At the 10 edges with the most amount of bytes transferred, what was the average number of bytes transferred?
| select(location, transferred:sum_long(bytes))
| group(location)
| sort(transferred DESC)
| limit(10)
| select(avg_bytes_in_top_10edges:long(avg(transferred)))

Back to articles