SpectX & Open Data: Parsing and Analysing html-lists

By Jüri Shamov-Liiver on 18 May 2017

I recently came upon this article about an introductory open data exercise using Python (scraping web, extracting data with regex and visualization). Fun learning indeed. Then I started wondering, how much much time and how many lines of code it would take to do the same thing with SpectX? Let’s find out.

So, here’s the web page containing the annual salaries of Cleveland colleges presidents. Our goal is to answer the question: how much tuition does one pay for five minutes of their college president’s time?

The first thing to do is to develop a SpectX pattern for extracting presidents and their salaries. Open Data Browser, paste in the url and press Prepare Pattern. Quickly scrolling through data in Data Editor I can see that the table with salaries is not there, likely because it is not within the first 16kB of the html-page. Well, we don’t need the whole page, all we need is just the list so let’s copy that from the page source (cmd+U in FireFox) to SpectX Data Editor in order to prepare a pattern (the queries will later still run directly on the original cleveland.com webpage). 

<p>Following is total compensation for other presidents at private colleges in Ohio in 2015:
</p>
<ul>
<li>Grant Cornwell, College of Wooster (left in 2015): $911,651</li>
<li>Marvin Krislov, Oberlin College (left in 2016): &nbsp;$829,913</li>
<li>Mark Roosevelt, Antioch College, (left in 2015): $507,672</li>
<li>Laurie Joyner, Wittenberg University (left in 2015): $463,504</li>
<li>Richard Giese, University of Mount Union (left in 2015): $453,800</li>
<li>Sean Decatur,Kenyon College: $451,698</li>
<li>Adam Weinberg, Denison University: $435,322</li>
<li>Daniel Dibiasio, Ohio Northern University: $414,716</li>
<li>Denvy Bowman, Capital University (left in 2016): $388,570</li>
<li>Anne Steele, Muskingum University (left in 2016): $384,233</li>
<li>Kathy Krendl, Otterbein University: &nbsp;$378,035</li>
<li>Rockwell Jones, Ohio Wesleyan University: $366,625</li>
<li>Robert Helmer, Baldwin Wallace University: $365,616</li>
<li>Robert Huntington, Heidelberg University: $300,005</li>
<li>Lori Varlotta, Hiram College: $293,336</li>
<li>Joseph Bruno, Marietta College (left in 2016): $288,295</li>
<li>W. Richard Merriman Jr., University of Mount Union (started in June 2015): $221,761</li>
</ul>
<p>The Rev. Robert Niehoff, president of John
This seems to be the only list in this page, therefore we can simply match list items, i.e the stuff between <li> and </li> tags:
'<li>'                  // the pattern begins with the list item opening tag
LD:president ','       /* LD matches everything until the next matcher in the line
                          (a comma in this case), i.e we get president's name.
                          Assigning the column a name: 'president' */
LD:college             // the college name
('(' LD ')')?         /* I'm putting leaving notes between parentheses.
                           Not all records have this,therefore I'm enclosing it
                           in the sequence group and making it optional.*/
                        /*Since we're not really interested in those leaving notes,
                          let's not include them as a column. */
':'                   //The college name is matched until colon
LD '$'                 //Skip everything until $
CDOUBLE:salary         /*Parse salary as a double value from the integer value
                          that has groups of thousands separated by a comma*/
'</li>' EOL           //list item closing tag followed by the line feed
Cool, the Parse Preview now has all the needed list items matched. The parse results also show that all elements from these items are parsed correctly. I'll now copy the pattern to the clipboard and go to the Data Browser again. It conveniently opens the last uri I navigated to, so I just need to press Prepare Query. Now I'll replace the automatically generated pattern by pasting the one from clipboard:

$pattern = <<<PATTERN   // I'm using the heredoc syntax for declaring the pattern string
'<li>' // the pattern begins with the list item opening tag
LD:president ',' /* LD matches everything until the next matcher in the line
                           (a comma in this case),i.e we get president's name.
                           Export as 'president'*/
LD:college // next, the college name
('(' LD ')')? /* I'm putting leaving notes between parentheses.
                           Not all records have this field,therefore I'm enclosing it
                           in the sequence group and making it optional. Since we're not
                           really interested in those leaving notes,
                           let's not include them as a column. */
':' // the college name is matched until the colon
LD '$' // skip everything until $
CDOUBLE:salary /* parse salary as a double value from the integer value
                           that has groups of thousands separated by a comma */
'</li>' EOL //list item closing tag followed by line feed
PATTERN;

@stream = PARSE(pattern:$pattern, src:'http://www.cleveland.com/metro/index.ssf/2017/12/case_western_reserve_university_president_barbara_snyders_base_salary_and_bonus_pay_tops_among_private_colleges_in_ohio.html');

@stream
.select(president, college, salary:salary*1000)
                           /* as we parsed salary as a double, we now need to
                           multiply it by 1000 to get the correct value */
.sort(salary DESC) //sort the results by the size of salary in descending order
Executing the query script gives us:

Pressing Chart displays a nice bar chart of salaries:

To compute the value of 5 minutes of a president's time (assuming 2000 working hours per year) we add another select command performing some simple arithmetic (divide annual salary by 2000 to get the hourly value and then divide by 12 to get the 5-min value, followed by rounding and casting to get a clean output):

@stream
.select(president, college, salary:salary*1000) /* having parsed salary as a double,
   we need to multiply it by 1000
   to get the correct value */
.sort(salary DESC) /*sort the results by the amount
   of salary in descending order */
.select(president ||', '|| college as president_college,
'$' || STRING(LONG(ROUND(salary/2000)/12)) as president_5min_value)
The result: 

Here's the full script. Feel free to copy, paste, execute and play around with it in SpectX (grab the 30-day free trial here).

$pattern = <<<PATTERN
'<li>' LD:president ',' LD:college ('(' LD ')')? ':' LD '$' CDOUBLE:salary '</li>' EOL
PATTERN;

PARSE(pattern:$pattern, src:'http://www.cleveland.com/metro/index.ssf/2017/12/case_western_reserve_university_president_barbara_snyders_base_salary_and_bonus_pay_tops_among_private_colleges_in_ohio.html')
.select(president, college, salary:salary*1000) /*having parsed salary as a double,
                                                      we need to multiply it by 1000
                                                      to get the correct value*/
.sort(salary DESC) /*sort the results by the amount
                                                      of salary in descending order */
.select(president ||', '|| college as president_college,
'$' || STRING(LONG(ROUND(salary/2000)/12)) as president_5min_value)
The time spent on developing this pattern and query was ~10 minutes. The whole script can easily fit under 10 lines when removing comments from the pattern and converting it to one line.

Back to articles