Detecting Human Behaviour on BlueCoat Proxy Logs

By Jüri Shamov-Liiver on 24 May 2018

This is the sequel to exploring BlueCoat proxy logs. Having defined their structure I now proceed to analyse actual log content. As my knowledge of the origin of these sample logs is limited,  I'll try to extract as much information about the nature of proxy deployment, user behaviour and their geographical location.
First, let's create a view of selected fields of interest (skipping the irrelevant ones).

$pattern = $[/shared/patterns/bluecoat-w3c-elff.sxp];
@src = PARSE(src:'file://data/chuvakin/bluecoat_proxy_big/Demo_log_*.log', pattern:$pattern);

@src
.filter(hdrLine is null)
.select(                     //select only the fields of interest
        dateTime
        ,c_ip                 //client ip address
        ,sc_status           //Protocol status code from appliance to client
        ,cs_method           //Request method used from client to appliance
        ,cs_uri_scheme       //Scheme from the 'log' URL
        ,cs_host             //Hostname from the 'log' URL.
        ,cs_uri_path         //Path from the 'log' URL. Does not include query.
        ,cs_uri_query         //Query from the 'log' URL.
        ,cs_username         //Relative username of a client authenticated to the proxy
        ,s_supplier_name     //Hostname of the upstream host (unavailable for a cache hit)
        ,rs_content_type     //Response header: Content-Type
        ,cs_useragent         //Request header: User-Agent
        ,sc_filter_result     //Content filtering result: Denied, Proxied or Observed
        ,sc_filter_category   //Content filtering category of the request URL
        ,x_virus_id           //Identifier of a virus if one was detected
        ,x_virus_details     //Details of a virus if one was detected)
.save('/shared/demo/bluecoat/bcdata.sxt');
Reading raw data from the local hard disk, extracting and transforming (parsing)  all 4 log files (int total 2,6Gb) takes about 10 seconds on my MacBook Pro with 8 CPU cores and 16 Gb RAM. Not too bad but I can get the speed even faster by saving parsed results and performing further queries on that. On saved results, even the most complex queries only take a few seconds. Note that if the dataset had included live updated data, I couldn't have taken this approach.

I have very little knowledge about the environment where the logs come from. I only know that they come from a lab (the Honeynet project) and they contain production (i.e live) and simulated (artificially generated) traffic.  However, I have no idea how this BlueCoat proxy was deployed, what sort of clients are there, etc. Hence let's see what I can deduct from the logs itself.

Sidenote:  I'm mostly using the piped command style in the query examples. All of them could also be written in the SQL style.

Traffic Dynamics, Origin and Destination Distributions

One of the first things I always check is how the traffic dynamics look like:
@src = @[/shared/demo/bluecoat/bcdata.sxt]; //declare variable @src referring to stored 
                                            //parsed data

@hourlyReq = @src                         //declare query using variable @hourlyReq
                                            //data from @src gets piped
                                            //to the next statement:
.select(
  dateTime[1 hour] as hourTrunc,         //truncate the timestamp to hour-precision,
                                            //assign alias hourTrunc
  count(*) as requests)                   //calculate counts, assign alias requests
.group(hourTrunc)                         //compute counts by hourTrunc groups
;

@hourlyReq                                 //execute query

/*the query written in SQL style will produce the exact same result:
@hourlyRequests =
SELECT
  dateTime[1 hour] as hourTrunc,         //truncate timestamp to hour precision,
                                            //assign alias hourTrunc
  count(*) as requests)                   //calculate counts, assign alias requests
GROUP BY hourTrunc                         //compute counts by hourTrunc groups
;*/
Requests per hour

There's approximately a month of records (from April 4'th to May 7'th 2005). I can observe three distinct periods: first ~10 days (5'th to 14'th April) with very little traffic (a few dozens of hourly requests), then a period of complete silence from 13'th to 27'th April and finally the last 10 days with rapidly increasing traffic.

Hmm ... It is very unlikely that such a traffic increase is caused by some malicious activity unless it's DDoS. The latter should show convergence to very few targets, so its easy to verify. More likely, my hypothesis is that the lab guys tested initial setup during the first 10 days, then went on to improve it (or party?) and after 2 weeks they launched the real thing. Ok, let's see what's the distribution of target sites:
@hourlyTargets = @src
.select(
  hourTrunc:dateTime[1 hour], //truncate timestamp to hour precision,
                                 //assign alias hourTrunc
  cs_host,
  count(*))                   //compute unique hourTrunc and cs_host pairs
.group(@1, @2)                 //group by first and second position
                                 //fields (hourTtrunc and cs_host)
.select(hourTrunc, targets:count(*))   /*compute the count of hosts
                                         per hourTrunc*/
.group(@1)
.join(@hourlyReq on left.hourTrunc = right.hourTrunc)
.select(tHour:left.hourTrunc, targets, requests);
@hourlyTargets;

As we can see, the number of target hosts follows clearly the same pattern as the number of requests during "live" period. To be sure,  we can also compute the Pearson correlation coefficient between the two:
@hourlyTargets
.select(CORR(targets, requests));   //compute Pearson correlation coefficient
@hourlyTargets;
which gives us 0.98 - i.e we see a strong correlation between the number of requests and the number of different hosts. We can now safely rule out the DDoS hypothesis.

Next, let's have a look at clients. Their distribution should give us some clue of how the proxy is set up.
@hourlyClients = @src
.select(
  hourTrunc:dateTime[1 hour], //truncate timestamp to hour-precision, assign alias hourTrunc
  c_ip,
  count(*))                   //compute unique hourTrunc and c_ip pairs
.group(@1, @2)
.select(hourTrunc, clients:count(*)) //count unique clients per hourTrunc
.group(@1)
.join(@hourlyReq on left.hourTrunc = right.hourTrunc)
.select(tHour:left.hourTrunc, clients, requests);

@hourlyClients;
Here, the upper diagram displays the first period (area greyed on the lower diagram). The lower diagram shows the trend for the entire period:
We can see a constant chattering of a few dozens of requests per hour from a few clients. This is mixed with occasional bursts of requests from the same low number of clients. Considering that the number of targets remains low as well - perhaps this is simulated traffic. Since the overall number of requests and targets is low I'll look at the requests from this period:
@testPeriodTraffic = @src
.filter(dateTime > T(2005,04,08) and dateTime < T(2005,04,12))
.select(cs_host, cs_uri_path, count(*))
.group(@1,@2)
.sort(@3 desc)
.limit(10);

@testPeriodTraffic;
cs_host
cs_uri_path
count_star
www.ositis.com
/UpdateLocation/ProxyAV/McafeeAntivirusEngineLocation
361
www.ositis.com/UpdateLocation/ProxyAV/McafeeAntivirusPatternLocation
350
www.ositis.com
/tests/testconnectivity.asp
171
10.2.13.200
/index.html
117
10.2.13.200
/favicon.ico
95
register.ositis.com
/cgi/RegCGI.cgi
68
download.windowsupdate.com
/v5a/windowsupdate/b/SelfUpdate/AU/x86/NetServer/en/wusetup.cab
37
10.2.13.200
/33
download.windowsupdate.com
/ws03sp1/windowsupdate/redir/wuredir.cab
23
download.bluecoat.com
/release/ProxyAV/2.2.1/2000E/update-22.asp
22

This looks very much like some setup and testing related activities: updating McAfee AV engine, virus patterns, downloading and applying operating system updates, setting up BlueCoat, etc.

The number of clients during the "live" period follows the pattern of requests and targets distribution (requests are not shown for better visibility):
Let's compute the correlation to be sure:
@clientRequestCorrelation = @hourlyClients
.select(CORR(targets, requests))   //compute Pearson correlation coefficient
;

@clientRequestCorrelation;
We get 0.98 again. Considering the quite high number of both clients and target hosts this looks like real live traffic. This is also suggested by the period of the waves on the graph: in GMT timezone traffic starts increasing at 1pm and dies off at 3am.

IP Network Analysis

Next, let's look at the ip countries contribution to the hourly traffic:
@hourlyCC = @src
.select(ccIP:cc(c_ip),
    dayTime:count(HOUR(dateTime)>8 and HOUR(dateTime)<16),
    nightTime:count(HOUR(dateTime)<=8 or HOUR(dateTime)>=16))
.group(@1);

@hourlyCC;
The biggest contributor is an unknown country. Internal address space? Let's see the traffic from class A networks:
@top5Aclass = @src
.select(c_ip & 0xFF000000l as network, requests:count(*))   //compute count by /8 network
.group(@1)
.sort(@2 desc)
.select(*, percentage:(DOUBLE(requests)/$totalRequests)*100); //compute the percentage of total
.limit(5);

@top5Aclass;

networkrequestspercentage
45.0.0.0
7999532
98.38808745736779
192.0.0.0
27032
0.3324727971770806
81.0.0.0
19310
0.23749813974139636
10.0.0.0
19181
0.23591153901500383
 213.0.0.0 15171 0.18659162496202614

It's weird that 98.4% of traffic originates from 45.0.0.0/8 range, which in 2005 belonged to the Interop Show Network - an annual trade fair for information technology organized by UBM (https://en.wikipedia.org/wiki/Interop). Coincidentally, the NetWorld(SM)+Interop(R) 2005 conference was held May 1-6, 2005 at the Mandalay Bay Convention Center in Las Vegas (https://www.businesswire.com/news/home/20041207005103/en/NetworldInterop-2005-Showcase-Largest-Assembly-Exhibitors-North). Did Honeynet guys set up their proxy to intercept conference traffic?

Requests graph @hourlyReq is computed with the timezone set to GMT, which gives us activity increase at nighttime. When changing timezone to PST (that's where Las Vegas happens to be), we observe activity starting to increase at 6-7 am and at 7 pm it has come down to a nightly level. There is also a distinct "saddle" around lunchtime. Looks like perfectly normal human behaviour to me.

Content Analysis

What does BlueCoat content filtering say? If it was a computer and information technology related event we should see Computers/Internet in the top5 of the content filtering category of the request URL:
@contentFilter = @src
.select(sc_filter_category, count(*)).group(@1).sort(@2 desc).limit(15);

@contentFilter;

sc_filter_categorycount_star
Computers/Internet
1445157
Web%20Advertisements
1429511
none
969769
Business/Economy
863626
News/Media
680855
Search%20Engines/Portals
381611
Email
259397
Travel
257823
Shopping
252155
Sports/Recreation/Hobbies
231239
Arts/Entertainment
141181
Brokerage/Trading
138826
Pornography
118691

Sure enough, we do. As well as Business, Economy, News and Email in prominent positions. Oh and porn on the 13'th place. In my opinion, porn is a good indicator of human behaviour - I have never seen a bot or trojan showing any interest in it. Let's have a look, what is the most popular time period for enjoying this sort of entertainment:
@hourlyPorn = @src
.select(
  hourTrunc:dateTime[1 hour], //truncate timestamp to hour precision, assign alias hourTrunc
  pornCnt:count(sc_filter_category = 'Pornography'),
  total:count(*))
.group(@1);

@hourlyPorn;

Apparently, no porn during the test period (looks like the Honeynet guys were truly committed to their work in setting up lab environment). Porn traffic appears only during the "live" period and at PST daytime. Attendees bored of the conference? Let's see the related traffic from class A networks:
@AClassPornNetworks = @src
.filter(sc_filter_category = 'Pornography')
.select(c_ip & 0xFF000000l as network, requests:count(*))   //compute count by /8 network
.group(@1)
.sort(@2 desc);

@AClassPornNetworks;

networkrequests
45.0.0.0
117813
204.0.0.0
878

Yeah, 99% are from the conference. Still, we're only humans and this confirms the hypothesis of real live traffic.

Moving on, let's see the BlueCoat virus information:
@viruses = @src
.select(x_virus_id,
    testPeriodCnt:count(dateTime < T(2005,04,27)),
    livePeriodCnt:count(dateTime >= T(2005,04,27))
    )
.group(@1).sort(@3 desc)
;

@viruses;

x_virus_idtestPeriodCntlivePeriodCnt
NULL11509
8118891
EICAR test file
1181
W32/Sober.gen@MM
048
Exploit-MhtRedir.gen
08
BackDoor-CPT
05
Downloader-RK
05
Exploit-ByteVerify
04
JS/Exploit-MhtRedir.ldr
04
StartPage-DU.dll
04
Downloader-LG
02
Downloader-YH.dr
02
W32/Sober.p@MM!zip
02
Downloader-VG
01
Downloader-YK
01
Exploit-DcomRpc.gen
01
Exploit-HelpZonePass
01
Exploit-ObscuredHtml
01
FDoS-UDPFlood
01
Generic Downloader.g
01
JS/Noclose.gen
01
JV/g
01
Prutec
01
VBS/Inor
01
VBS/Psyme
01
Vundo
01
W32/Netsky.z@MM!zip
01

25 different viruses identified. They make 0,002% of all requests, and half of them are not real virus infections but testing (EICAR test file). Which sites are distributing this stuff?
@dropperSites = @src
.filter(x_virus_id is not null AND x_virus_id != 'EICAR test file')
.select(cs_host, count(*)).group(@1).sort(@2 desc);

@dropperSites;
37 sites. And techlabs.bluecoat.com among them. How come? They're supposed to be a decent vendor and not a dodgy place to catch viruses.
Let's have a closer look :
@bluecoatTechlabs = @src
.filter(x_virus_id is not null AND x_virus_id != 'EICAR test file')
.filter(cs_host = 'techlabs.bluecoat.com')
.select(dateTime, sc_status, cs_uri_path, x_virus_id, x_virus_details);

@bluecoatTechlabs;

dateTimesc_statuscs_uri_pathx_virus_idx_virus_details
 2005-04-28 17:34:12,200 200 /eicar/noisybear.cab JV/g Virus: JV/g; File: noisybear.cab; Sub File: \NoisyBear.class; Vendor: Network Associates, Inc (McAfee); Engine error code: None; Engine version: 10.21; Pattern version: 4.3.20.124865; Pattern date: 04/28/2005 04:40:00

noisybear.cab - creation of Mark LaDue proof-of-concept malicious Java program (http://www.cigital.com/hostile-applets). Displays a picture of a bear emitting annoying noise. Used for testing AV detection, similarly to eicar. We can safely exclude it from real dropper sites.

How many clients infected? With which viruses? When? The following query produces a list of client IPs with real infection attempts. In an enterprise environment setup, this would serve a good starting point for CSIRT team mitigation actions.
@infectedClients = @src
.filter(x_virus_id is not null AND sc_status = 200)
.filter(x_virus_id != 'EICAR test file')
.filter(cs_host != 'techlabs.bluecoat.com')
.select(
c_ip,
x_virus_id,
min_dateTime:min(dateTime)) //get the earliest client timestamp of infection
.group(@1, @2);

@infectedclients;

Conclusion

The logs consisted of web traffic records from April 4’th to May 7’th 2005. It included three distinct periods, that I identified as:
  1. the test period of initial setup (from 5’th to 14’th April)
  2. a period of complete silence (from 13’th to 27’th April) and
  3. the "live" period from April 27'th to May 7'th
The testing period is characterized by a relatively low number of requests (ten to a few hundred per hour) to a few selected sites, related to setting up the BlueCoat and antivirus detection engine, updating operating system (Win XP SP1).

The silence period could mean the lab folks spending their time on setting up a surrounding environment for the live period.

Finally, the "live" period is characterized with a rapidly increasing amount of traffic, peaking on 5 May at 260 000 requests per hour and then gradually decreasing. The number of requests, originating clients and target hosts is high enough to make the artificial generation very complicated and unlikely. The distribution of originating clients and target hosts very closely follows the traffic activity patterns.

The activity patterns seem to match normal daily human activity: rapid increase from early morning hours, decreasing in the afternoon with distinct lower "saddle" indicating a lunch break between them. The timezone matching this pattern will place the originating activity to the US west coast (PST). Another aspect indicating human nature in origin of traffic is the presence of adult sites visits.

Network analysis suggests strongly that the traffic is originating from participants of NetWorld(SM)+Interop(R) conference held May 1 - 6 at the Mandalay Bay Convention Center in Las Vegas.

Observed virus infections were relatively low: only 0,002% of traffic relates to known viruses. Half of that relates to testing antivirus engines. From 3499 clients 49 were performing real virus downloads (i.e infection attempts).

You can download the whole script from here: https://docs.spectx.com/_downloads/bcscript.zip. Logs are available for download from here: http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip

Back to articles