use spark to calculate salary statistics for georgia educators (the fourth book of the trilogy)

Blog post
edited by
Lester Martin

As the title suggests, this posting was something I came up with AFTER I published the first three installments of my Open Georgia Analysis way back in 2014. And yes, you might have also noticed I took a long break from blogging about Big Data technologies in 2018 and I’m hoping to change that for 2019. On the other hand, my personal blog had a LOT of fun entries due to a TON of international travel in 2018.

Dataset & Use Case

I decided to clean up the dataset used so I could focus on the Simple Open Georgia Use Case so I ran the cleanup code from use pig to calculate salary statistics for georgia educators (second of a three-part series) to prepare a transformed version of the file which you can find at the bottom of Preparing Open Georgia Test Data. This will make the Spark code much simpler and more germane to the analysis code I want to present.

On an HDP 3.1.0 cluster I have available, I loaded this file as shown below.

[dev1@ip-172-30-10-1 ~]$ hdfs dfs -ls
Found 1 items
-rw-r--r--   3 dev1 hdfs    7133542 2019-03-09 22:33 cleanSalaryTravelReport.tsv
[dev1@ip-172-30-10-1 ~]$ hdfs dfs -tail cleanSalaryTravelReport.tsv
ZUCKER,STACEY E	PARAPROFESSIONAL/TEACHER AIDE	23387.87	0.0	LBOE	FULTON COUNTY BOARD OF EDUCATION	2012
ZURAS,LINDA D	SPECIAL ED PARAPRO/AIDE	29046.0	0.0	LBOE	FULTON COUNTY BOARD OF EDUCATION	2012
ZVONAR,JESSICA L	GIFTED	41672.9	44.37	LBOE	FULTON COUNTY BOARD OF EDUCATION	2012
ZWEIGEL,RENEE E	SCHOOL SECRETARY/CLERK	42681.23	0.0	LBOE	FULTON COUNTY BOARD OF EDUCATION	2012
[dev1@ip-172-30-10-1 ~]$ 

RDD Implementation

The cluster I am using has Spark 2.3.2 available to me as shown from the pyspark shell I will be using for my Resilient Distributed Dataset (RDD) API example.

[dev1@ip-172-30-10-1 ~]$ pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   version 2.3.2.3.1.0.0-78
      /_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>> 

NOTE: The RDD Programming Guide is a great reference.

Load up the HDFS file and verify it looks good.

>>> inputRDD = sc.textFile("/user/dev1/cleanSalaryTravelReport.tsv")
>>> inputRDD.take(2)
[u'ABBOTT,DEEDEE WtGRADES 9-12 TEACHERt52122.1t0.0tLBOEtATLANTA INDEPENDENT SCHOOL SYSTEMt2010', u'ABBOTT,RYAN VtGRADE 4 TEACHERt56567.24t0.0tLBOEtATLANTA INDEPENDENT SCHOOL SYSTEMt2010']
>>> inputRDD.count()
76943

Tokenize the tab-separated string to create an array.

>>> arrayRDD = inputRDD.map(lambda val: val.split("t"))
>>> arrayRDD.take(2)
[[u'ABBOTT,DEEDEE W', u'GRADES 9-12 TEACHER', u'52122.1', u'0.0', u'LBOE', u'ATLANTA INDEPENDENT SCHOOL SYSTEM', u'2010'], [u'ABBOTT,RYAN V', u'GRADE 4 TEACHER', u'56567.24', u'0.0', u'LBOE', u'ATLANTA INDEPENDENT SCHOOL SYSTEM', u'2010']]
>>> arrayRDD.count()
76943

Trim down to just the Local Boards of Education.

>>> justLBOE = arrayRDD.filter(lambda empRec: empRec[4] == 'LBOE')
>>> justLBOE.take(2)
[[u'ABBOTT,DEEDEE W', u'GRADES 9-12 TEACHER', u'52122.1', u'0.0', u'LBOE', u'ATLANTA INDEPENDENT SCHOOL SYSTEM', u'2010'], [u'ABBOTT,RYAN V', u'GRADE 4 TEACHER', u'56567.24', u'0.0', u'LBOE', u'ATLANTA INDEPENDENT SCHOOL SYSTEM', u'2010']]
>>> justLBOE.count()
75029

We just want the 2010 records.

>>> just2010 = justLBOE.filter(lambda empRec: empRec[6] == '2010')
>>> just2010.take(2)
[[u'ABBOTT,DEEDEE W', u'GRADES 9-12 TEACHER', u'52122.1', u'0.0', u'LBOE', u'ATLANTA INDEPENDENT SCHOOL SYSTEM', u'2010'], [u'ABBOTT,RYAN V', u'GRADE 4 TEACHER', u'56567.24', u'0.0', u'LBOE', u'ATLANTA INDEPENDENT SCHOOL SYSTEM', u'2010']]
>>> just2010.count()
44986

Number of Employees by Title

To figure out how many people are in each job title we need to treat this a bit like the canonical WordCount problem. We create KVPs of the titles as the keys and hard-code a 1 as the value.

>>> titleKVPs = just2010.map(lambda eR: (eR[1],1))
>>> titleKVPs.take(2)
[(u'GRADES 9-12 TEACHER', 1), (u'GRADE 4 TEACHER', 1)]

Then we can loop through them to present the totals for each.

>>> titleTotals = titleKVPs.reduceByKey(lambda a,b: a+b)
>>> titleTotals.take(2)
[(u'SPEECH-LANGUAGE PATHOLOGIST', 1268), (u'RVI TEACHER', 58)]
>>> titleTotals.count()
181

We can chain these commands together and just add a sort at the end and show all results.

We just bucket them up by their job titles.

>>> byTitleAndSalary = just2010.map(lambda empRec: (empRec[1], empRec[2]))
>>> byTitleAndSalary.take(2)
[(u'GRADES 9-12 TEACHER', u'52122.1'), (u'GRADE 4 TEACHER', u'56567.24')]

fdfdf

WORK IN PROGRESS

Leave a Comment

Your email address will not be published. Required fields are marked *