Parsing BlueCoat Proxy Logs

By Jüri Shamov-Liiver on 17 May 2018

Finding sample logs on the web is not an easy task. This is quite understandable as log content is nowadays considered sensitive (hint: GDPR). Therefore I was excited to come across these samples. Many thanks, @anton_chuvakin and @automine for mentioning this on Twitter. Loads of samples of different logs there to play around with. A bit outdated in terms of their collection dates but very much up-to-date in terms of their nightmarish structure. Some of the packages are the source data of the HoneyNet Project Challenges. These make great lessons for performing security log analysis.

I headed for BlueCoat ProxySG logs as something we've not yet played around with SpectX. As always, the first thing to look at was the structure of the log for parsing out the fields I need. So in I go with Pattern Developer (press Prepare Pattern after browsing to the downloaded log in Data Browser).
By the way, if it weren't for the zip-compression, SpectX could also run queries on files hosted on the project page without importing them anywhere.

Back to the pattern. What a cool surprise: one of the apparent header rows (prefixed with a #) contains field names! Great, no need to figure them out myself. Records appear to be in one row with a field separated by spaces. But what's their format (timestamp, integer, quoted string,...)? The comment on log-sharing.dreamhosters.com site mentions this is a standard web proxy log in W3C format..., so perhaps a manual would give us even a quicker and more precise definition than figuring it out based on actual field data?

A quick Google and voilà: the BlueCoat Admin Manual for SGOS 5.1.4 listing 10 templates for W3C ELFF formats to choose from. None of them matches the set of fields in the sample though. The closest one seems to be a subset of the 10 first fields of the sample. Perhaps the "standard" is not that universal after all... Ok, moving on. A table in Appendix B has helpfully listed the meanings of the fields. Kewl, saving some time again. However, the field formats appear nowhere to be seen. Well, nobody or -thing is perfect. Luckily, finding correct data format types is pretty straightforward with SpectX, so it shouldn't take too much time.

Since my goal is to explore the log from all possible angles I'd like to parse out every one of the fields. SpectX Pattern Developer is invaluable here, lighting up the matched fields as I enter matcher tokens:


So, I go field by field from right to left. Seems fairly simple until the cs-uri-query field. Some rows contain '-'.  An unusual value for a uri query, perhaps signifying that cs-uri-query has no value? No clues in W3C ELFF format spec. Luckily the BlueCoat Admin Manual Appendix B confirms the no-value theory. Therefore, I can use alternatives list to match '-' value to NULL. The same also applies for other
fields, such as cs-username, RS(Content-Type), CS(User-Agent), x-virus-id, x-virus-details, x-icap-error-code, x-icap-error-details. It seems the latter three are not always present, so I'm making them optional:

TIMESTAMP('yyyy-MM-dd HH:mm:ss', tz='GMT'):dateTime ' '

//this captures both date and time fields in TIMESTAMP type, followed by space.
//NB! The date nor time fields contain timezone info, I have to assign it myself.
//GMT is always a good candidate.

LONG:time_taken ' ' //Processing time could potentially be long. Let's use LONG type to capture it.
IPV4:c_ip ' '       //The logs are from 2005, so no IPV6 yet :-)
INT:sc_status ' '   //http status code is INTEGER, we know that by heart
('-' | [A-Z_]+:s_action) ' '
//the s-action field seems to contain uppercase letters and an underscore (Appendix B, Table 7-3)

LONG:sc_bytes ' '
LONG:cs_bytes ' '
LD:cs_method ' '
LD:cs_uri_scheme ' '
LD:cs_host ' '
LD:cs_uri_path ' '
('-' | LD:cs_uri_query) ' '
('-' | LD:cs_username) ' '
LD:s_hierarchy ' '
LD:s_supplier_name ' '
('-' | LD:rs_content_type) ' '
('-' | DQS:cs_useragent) ' '
UPPER:sc_filter_result ' '
LD:sc_filter_category ' '
('-' | DQS:x_virus_id) ' '
IPV4:s_ip ' '
LD:s_sitename
(' ' DQS:x_virus_details)?
(' ' LD:x_icap_error_code)?
(' ' LD:x_icap_error_details)?
[\n\r]+;

Ok, all the 26 fields are now described. But the header lines remain unmatched. This could be quite annoying if I wanted to verify this pattern against all the logs (the Pattern Developer only takes the first 16 Kb of sample data). I want the _unmatched field to contain only data related to records. So I'm adding a pattern for the header rows, too.

// The header and data records need to be defined separately:

$hdr = //header record:
'#' //header always begins with # symbol
LD:hdrLine //capture the rest of line with the LD wildcard
[\n\r]+;

$record =
TIMESTAMP('yyyy-MM-dd HH:mm:ss', tz='GMT'):dateTime ' '

//the data record:
//captures both date and time fields in TIMESTAMP type, followed by space.
//NB! as date and time fields don't contain timezone info, we have to assign it
//ourselves. GMT is always a good candidate.

LONG:time_taken ' ' //processing the time could take a while. Let's use the LONG type to capture it.
IPV4:c_ip ' '       //it's 2005, so no IPV6 yet :-)
INT:sc_status ' '   //http status code is INTEGER, we know that by heart
('-' | [A-Z_]+:s_action) ' '
//the s-action field seems to be uppercase letters and underscore (Appendix B, Table 7-3)

LONG:sc_bytes ' '
LONG:cs_bytes ' '
LD:cs_method ' '
LD:cs_uri_scheme ' '
LD:cs_host ' '
LD:cs_uri_path ' '
('-' | LD{1,20000}:cs_uri_query) ' '
('-' | LD:cs_username) ' '
LD:s_hierarchy ' '
LD:s_supplier_name ' '
('-' | LD:rs_content_type) ' '
('-' | DQS:cs_useragent) ' '
UPPER:sc_filter_result ' '
LD:sc_filter_category ' '
('-' | DQS:x_virus_id) ' '
IPV4:s_ip ' '
LD:s_sitename
(' ' DQS:x_virus_details)?
(' ' LD:x_icap_error_code)?
(' ' LD:x_icap_error_details)?
[\n\r]+;

//the pattern can consist of either header OR data records:
($hdr | $record)
Right, no _unmatched data in the sample any more and let's verify how SpectX handles all the data we have. Pressing Prepare Query
takes us to Query Editor with the prepared skeleton query:

$patt = $[/shared/patterns/bluecoat-w3c-elff.sxp];

//PARSE cmd applies our pattern to file(s) pointed to by src uri:
@stream = PARSE(src:'file://data/chuvakin/bluecoat_proxy_big/Demo_log_001.log', pattern:$patt);

//use cmdline syntax in writing query:
@stream
.select(_unmatched, *) //select the _unmatched system field and all pattern-defined fields

Executing the query we get ca 22kB unmatched bytes:



To examine what is still unmatched, I'll just add a simple filter-statement that outputs only the records where the _unmatched field actually has some value. Also, I'm changing the file name in the uri so that all the logs get involved:

$patt = $[/shared/patterns/bluecoat-w3c-elff.sxp];

//PARSE cmd applies our pattern to the file(s) pointed to by the src uri
//by replacing 001 with the wildcard in the filename I have all 4 log files included in parsing
@stream = PARSE(src:'file://data/chuvakin/bluecoat_proxy_big/Demo_log_*.log', pattern:$patt);

//use cmdline syntax in writing query:
@stream
.select(_unmatched, *) //select the _unmatched system field and all pattern-defined fields
.filter(_unmatched is not null) //I'm interested only in parsing errors

Executing this I still get 4 records with unmatched data. A double-click on one of them opens a detailed view, which immediately reveals the reason for the parsing error: the uri query part seems to be very long, which obviously does not fit into the cs_uri_query default maximum length of 4096 bytes.


Clearly, I should have anticipated this, particularly when I expect to encounter some bad guys in my further exploration. Back to the pattern to correct it:

('-' | LD{1,20000}:cs_uri_query) ' ' //added non-default max length of 20 Kb.
Save and trying the verifying query again. The result: zero unmatched bytes and 0 rows in the resultset (i.e. no unmatched bytes). Success!

The pattern above took less than an hour to develop. This included searching-reading documentation and drinking coffee.
Would you like to try this with Python and regex? ;-)

Back to articles