Saturday, May 23, 2015

Data Mining and Data Scientist Salary Estimates in the Philippines


Motivation

As I am going back to the Philippines to pursue further studies in Statistics, it intrigues me if Data Mining and Data Science are catching up. I am seeing some positions in jobsearch websites such as Jobstreet so as a data miner, I extracted the relevant job openings that are related to the key phrases:
  • Data Mining; and
  • Data Scientist.
These may look too specific but this is just a quick draft, anyway. Also, I did not include Data Analyst as this scopes a broader job scope diversity than the two mentioned. Also any intensive text extraction using basic Information Retrieval methods is not used.

Warning: The result of the models should not be used to provide recommendations as data the is collected using a convenience sample without performing accuracy tests, only k-fold cross validations against the training set when CART is used.

Data Set

The data is collected manually by searching for relevant job openings active today, 22 May, 2015. I have an assumption that that the data set is relatively small, and so less than 30 positions is returned. Pre-processing is done externally, in Excel, to remove currency prefix, i.e. PHP and text in experience, etc.

library(RCurl)
jobstreet <- getURL("https://raw.githubusercontent.com/foxyreign/Adhocs/master/Jobstreet.csv", ssl.verifypeer=0L, followlocation=1L) # Download from github
writeLines(jobstreet, "Jobstreet.csv")
df <- read.csv('Jobstreet.csv', head=T, sep=",") # Load dataset
df <- na.omit(df) # Exclude missing data

summary(df) # Summarize
##  Expected.Salary    Experience       Education    
##  Min.   : 11000   Min.   : 0.000   Min.   :1.000  
##  1st Qu.: 19000   1st Qu.: 2.000   1st Qu.:2.000  
##  Median : 28000   Median : 4.000   Median :2.000  
##  Mean   : 35016   Mean   : 4.919   Mean   :2.179  
##  3rd Qu.: 40000   3rd Qu.: 8.000   3rd Qu.:2.000  
##  Max.   :130000   Max.   :20.000   Max.   :4.000  
##                                                   
##                  Specialization           Position 
##  IT-Software            :30     Data Mining   :72  
##  -                      :17     Data Scientist:51  
##  IT-Network/Sys/DB Admin:14                        
##  Actuarial/Statistics   :12                        
##  Banking/Financial      : 8                        
##  Electronics            : 8                        
##  (Other)                :34
As mentioned, there are only approximately 120 job applicants which applied for these two grouped positions. Since the data does not mention if an applicant applied for more than one position, I assume that these are distinct records of applicants per position and/or position group, Data Mining and Data Scientist.

Variables

  1. Expected.Salary - numerical. The expected salary of each applicant based on their profile.
  2. Experience - ordinal but treated as numerical for easier interpretation in the later algorithms used. This is the years of work experience of the applicant.
  3. Education - categorical; not used in the models because of extreme unbalance in proportions. This is labelled as:
    • 1 - Secondary School
    • 2 - Bachelor Degree
    • 3 - Post Graduate Diploma
    • 4 - Professional Degree
  4. Specialization - categorical; not used in this analysis.
  5. Position - categorical. Data Mining or Data Scientist
  6. Education.Group - categorical. Additional variable to bin the years of experience.
# Categorize education variable
df$Education <- factor(df$Education, levels = c(1,2,3,4), 
                       labels=(c("Secondary Sch", "Bach Degree", 
                                 "Post Grad Dip", "Prof Degree")))

# Bin years of experience
df$Experience.Group <- ifelse(df$Experience < 3, "3 Years", 
                              ifelse(df$Experience < 5, "5 Years",
                                     ifelse(df$Experience < 10, "10 Years", "+10 Years")))
df$Experience.Group <- factor(df$Experience.Group, 
                              levels=c("3 Years", "5 Years", "10 Years", "+10 Years"))

# Drop variables
df <- df[, !(colnames(df) %in% c("Education","Specialization"))]

# Subsets positions
mining <- subset(df, Position == "Data Mining")
scientist <- subset(df, Position == "Data Scientist")

Distribution

As expected, Data Scientists have a higher expected salary although this is so dispersed that even if I compare these two using a t-test assuming heteroskedastic distribution, there is a significant difference between the averages expected salaries of the two positions.
require(ggplot2)
require(scales)

# Boxplot
ggplot(df, aes(x=factor(0), y=Expected.Salary, fill=Experience.Group)) + 
  facet_wrap(~Position) + geom_boxplot() + xlab(NULL) + 
  scale_y_continuous(labels = comma) + 
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position="bottom")
Boxplot
Distribution of Expected Salaries
# T-test
t.test(Expected.Salary ~ Position, paired = FALSE, data = df)
## 
##  Welch Two Sample t-test
## 
## data:  Expected.Salary by Position
## t = -3.3801, df = 68.611, p-value = 0.001199
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24086.501  -6205.983
## sample estimates:
##    mean in group Data Mining mean in group Data Scientist 
##                     28736.11                     43882.35
# Median expected salaries of Data Mining vs Data Scientist
c(median(mining$Expected.Salary), median(scientist$Expected.Salary))
## [1] 25000 30000
Come on fellow data enthusiasts, you should do better than this! The difference of their medians is just 5,000 PHP. In my honest opinion, these center values are way below based on the prospective demand of shortage of these people who can understand data in the next 10 years.

Regression

The intercept is not included in the model because I want to see the contrast between Data Mining and Data Scientist although I already computed it beforehand. Besides, though the linear regressio model shows significant value, but when doing diagnostics, linear approach is not appropriate because the residual errors are not random and depict a funnel shape based on their errors. The quantitative measures of regressions for significance cutoff are: $r_{adj}^{2}>0.80, p<0.05$

The regression output coefficients are interpreted as follows:
$$y = \beta_{0}(12,934.9) + \beta_{1}(3,336.3) + \beta_{2}$$
# Estimate coefficients of linear regression model
summary(lm(Expected.Salary ~ Experience + Position-1, data=df))
## Call:
## lm(formula = Expected.Salary ~ Experience + Position - 1, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45312 -11123  -1280   6877  66688 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## Experience               3336.3      446.3   7.476 1.38e-11 ***
## PositionData Mining     12934.9     3024.3   4.277 3.83e-05 ***
## PositionData Scientist  26612.0     3455.8   7.701 4.27e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18350 on 120 degrees of freedom
## Multiple R-squared:  0.8136, Adjusted R-squared:  0.809 
## F-statistic: 174.6 on 3 and 120 DF,  p-value: < 2.2e-16
# Scatter plot
ggplot(df, aes(x=Experience, y=Expected.Salary)) + 
  geom_point(aes(col=Experience.Group)) + 
  facet_wrap(~Position) + 
  scale_y_continuous(labels = comma) + 
  stat_smooth(method="lm", fullrange = T) + 
  theme(legend.position="bottom")
Scatterplot
Scatterplot of Expected Salary per Year of Experience
# Diagnose LM
par(mfrow=c(1,2))
plot(lm(Expected.Salary ~ Experience + Position-1, data=df), c(1,2))
Linear Regression Estimates and Diagnostics



CART

Information Gain is used to divide the nodes based on weighted average entropy as linear regression does not do well with the data set. Of course, years of experience is more influential than the position.

Looking at the estimated salaries from the printed tree, applicants who have years of experience lower than 1.5 are approximately expecting 17,000 PHP. While those who applied for Data Mining jobs with more than 6.5 years of experience are expecting 66,000 PHP on average.

require(rpart)
require(rattle)

cart <- rpart(formula = Expected.Salary ~ Experience + Position, 
              data = df, 
              parms = list(split = "information"), # Uses information gain
              model = T) # Retains model information

# Plot tree
layout(matrix(c(1,2,3,4), nrow = 1, ncol = 2, byrow = TRUE), widths=c(1.5,2.5)) 
barplot(cart$variable.importance, 
        cex.names = 0.6, cex.axis = 0.5,
        sub = "Variable Importance") 
fancyRpartPlot(cart, main=NULL, sub=NULL)
Rpart
Decision Tree using CART and Variable Importance
# Estimates
print(cart); printcp(cart)
## n= 123 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 123 66101970000 35016.26  
##    2) Experience< 4.5 64  7326938000 23781.25  
##      4) Experience< 1.5 27   520666700 16888.89 *
##      5) Experience>=1.5 37  4587676000 28810.81  
##       10) Position=Data Mining 24  1055625000 24375.00 *
##       11) Position=Data Scientist 13  2188000000 37000.00 *
##    3) Experience>=4.5 59 41933560000 47203.39  
##      6) Position=Data Mining 33  9501333000 37333.33 *
##      7) Position=Data Scientist 26 25137120000 59730.77  
##       14) Experience< 6.5 8  5237500000 46250.00 *
##       15) Experience>=6.5 18 17799610000 65722.22 *
## 
## Regression tree:
## rpart(formula = Expected.Salary ~ Experience + Position, data = df, 
##     model = T, parms = list(split = "information"))
## 
## Variables actually used in tree construction:
## [1] Experience Position  
## 
## Root node error: 6.6102e+10/123 = 537414370
## 
## n= 123 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.254780      0   1.00000 1.00888 0.19245
## 2 0.110361      1   0.74522 0.81057 0.14709
## 3 0.033563      2   0.63486 0.77285 0.13292
## 4 0.031769      3   0.60130 0.76228 0.13052
## 5 0.020333      4   0.56953 0.74207 0.12969
## 6 0.010000      5   0.54919 0.70258 0.12366
Again, fellow data miners and data scientists, ask for more! You do not realize your worth with the current demand of people who can understand data.

The full paper on this can be downloaded at https://github.com/foxyreign/Adhocs/blob/master/Jobstreet.pdf


Thursday, May 7, 2015

The Data Science Handbook

If you are like me who is starting career in Data Science field, there's no other good advice from people who are notable in the same field; of course, experience and proper education are advisable. 


I am currently reading a book of compilations of insights from Data Scientists across the board - from Marketing, Astrophysics, Entrepreneurship, Information Technology, Classical Statistics and Machine Learning, etc. This is my first non-fictional book I picked up apart from the R Programming book that came along with the R Programming Class co-offered by Johns Hopkins University and Coursera as part of the Data Science Specialization.

The book is free or you can pay a considerate amount if you want. 

Thursday, April 9, 2015

Fitness Goals for 2015


Here I am again gaining back my momentum where I fell off maintaining a healthier lifestyle two years ago. This time, for sure, I will stick to it until I reach the 3rd phase of my target and maintain it. Besides, I will be the loser again if I cannot reach nor maintain this.



I used moving average to smooth the data because I do not feel that a linear trend on a time-series model is not appropriate given that I have very few data points; maybe after three months of tracking, I will include that.

What really concerns me is that I see a statistically significant correlation between the increase in my fat and fat-free muscle mass. In due time, this has to change - let it be flat or better negatively correlated variables. Ill definitely look very muscular if that happens.