Objectives

In this hypothetical scenario, pretend that you are a student currently learning the R language. Take your time and really get into the role.

We’re going to use public data from the U.S. Department of Labor in order to lightly research various R-related occupations, the values and interests of R-using occupational incumbents, the labor market value of the R language, and we conclude by predicting your future salaries based on your work values!


Data: O*NET & SOC Codes

O*NET, or the Occupational Information Network, is the admirable pursuit by the Department of Labor to consolidate the classification of approximately one thousand kinds of jobs. If that seems like a lot, it comes from the even more voluminous DOT (Dictionary of Occupational Titles), which defined over 13,000 jobs since 1938. To do this, all federal agencies use a unique identifier for each job type, known as a SOC or Standard Occcupational Classification. SOC codes, typically variable O*NET-SOC Code, most often become the unique keys necessary to joining our tables!


The O*NET Resource Center contains all individual tables used to power My Next Move and other platforms. Tables are organized according to myriad dimensions, seen here:



Exploring O*NET Tables

Let’s look at a few tables to get to know the data.


Occupation Titles

Read in one of the highest-level tables, Occupation Data, which contains:

  1. Occupation SOC Code
  2. Occupation Title
  3. Occupation Description


url <- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Occupation%20Data.txt"

occupations <- read_delim(url, show = FALSE)

occupations %>% 
  head(3) %>%          # Show first 3
  pander()
O*NET-SOC Code Title Description
11-1011.00 Chief Executives Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers.
11-1011.03 Chief Sustainability Officers Communicate and coordinate with management, shareholders, customers, and employees to address sustainability issues. Enact or oversee a corporate sustainability strategy.
11-1021.00 General and Operations Managers Plan, direct, or coordinate the operations of public or private sector organizations, overseeing multiple departments or locations. Duties and responsibilities include formulating policies, managing daily operations, and planning the use of materials and human resources, but are too diverse and general in nature to be classified in any one functional area of management or administration, such as personnel, purchasing, or administrative services. Usually manage through subordinate supervisors. Excludes First-Line Supervisors.


Occupation Technologies

Now, let’s take a look at a more sophisticated table, Technology Skills, which contains:

  1. Occupation SOC Code
  2. Technology Title
  3. Technology Example
  4. UNSPSC Commodity Code
  5. UNSPSC Commodity Title
  6. “Hot” Technology


Here, the first 6 observations all have the same SOC Code. That’s because “Technology Skills” lists all technologies associated with the occupation. Our preview shows the first 6 technologies important for occupation SOC code 11-1011.00, or that of “Chief Executives”.


url <- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Technology%20Skills.txt"

technologies <- read_delim(url, show = FALSE)

technologies %>% 
  head(6) %>%          # Show first 6
  pander()
Table continues below
O*NET-SOC Code Example Commodity Code
11-1011.00 Adobe Systems Adobe Acrobat 43232202
11-1011.00 AdSense Tracker 43232306
11-1011.00 Atlassian JIRA 43232201
11-1011.00 Blackbaud The Raiser’s Edge 43232303
11-1011.00 ComputerEase construction accounting software 43231601
11-1011.00 Database reporting software 43232305
Commodity Title Hot Technology
Document management software Y
Data base user interface and query software N
Content workflow software Y
Customer relationship management CRM software N
Accounting software N
Data base reporting software N


Joining Tables

Recall that SOC Codes are unique identifiers for each occupation and are most often used fully, or in part, as merge keys.


We can merge these using dplyr function left_join(), specifying argument by =:

occupations %>% 
  left_join(technologies, 
            by = `O*NET-SOC Code`)


Alternatively, we can use base R function merge():

merge(x = occupations, 
      y = technologies, 
      by = "O*NET-SOC Code")


In fact, if we simply use left_join() without identifying a merge column, it automatically detects merge keys with common names and notifies us in the console.

occupations %>% 
  left_join(technologies) %>% 
  select(Title, `Commodity Title`) %>% 
  head(100)
## Joining, by = "O*NET-SOC Code"


Insights with Joins

Now that we’re capable of joining O*NET database tables, we can begin to glean our first insights. For example:

  1. How many technologies are associated with each occupation?
  2. Which occupations require the most diverse array of technologies?


We can capture this quickly because our occupations-to-technologies merge is one-to-many, meaning that one record in one table corresponds to multiple records in another, adjoining table. Hence, if we count the number of new rows created by merging occupation titles and their many technologies, a simple row count will sum up to total unique technologies by occupation.


occupations %>% 
  left_join(technologies) %>% 
  group_by(Title) %>% 
  summarize(Technologies = n()) %>% 
  arrange(-Technologies) %>% 
  head(10)
## Joining, by = "O*NET-SOC Code"


Assignment

Follow the instructions below. Click the following icon to download the Lab 06 template for your submission.



The following objects have been created for you from the O*NET 26.3 Database.

  1. Abilities measures innate or natural abilities of occupational incumbents
  2. AlternateTitles lists over 53,000 alternate names for O*NET occupations
  3. Experience measures various kinds of experience, albeit encoded
  4. ExperienceCategories decodes measured experience
  5. Interests measures occupations with the Holland Code (RIASEC) Test
  6. JobZoneReference decodes classes of career development
  7. JobZones broad classes of career development for each occupation; encoded
  8. Knowledge defines areas of expertise and importance
  9. OccupationData lists the occupation SOC, title, and description
  10. Skills measures learned abilities of occupational incumbents
  11. TechnologySkills lists all relevant technology skills
  12. WorkStyles measures various performance indicators
  13. WorkValues measures important occupation qualities




Part 1: Basic Join Challenges

The following challenges require no more then 2-3 tables and use the same column names across all tables. Use basic dplyr verbs to answer specific questions.


(1) Question: Which 10 occupation titles (OccupationData$Title) have the highest number of “hot technologies” (TechnologySkills$HotTechnology)?

# Code


(2) Question: Which 10 occupation titles (OccupationData$Title), with more than 15 unique technologies, have the highest proportion (%) of “hot technologies” (TechnologySkills$HotTechnology)?

# Code


(3) Question: Which unique occupation titles (OccupationData$Title) have “R” listed as a technology example (TechnologySkills$Example)?

# Code


(4) Question: Which alternate occupation titles (AlternateTitles) have “R” listed as a technology example (TechnologySkills$Example)?

# Code


(5) Question: Which top 25 occupation titles (OccupationData$Title) have the highest level of “Independence” (Element Name) according to WorkValues?

# Code


(6) Question: Which top 25 occupation titles (OccupationData$Title) have the highest level of “Persistence” (Element Name) according to WorkStyles?

# Code


(7) Question: According to Holland Codes, “Realistic” describes people who like to work with things, requiring “motor coordination, skill, and strength”. Which top 20 occupation titles (OccupationData$Title) have the highest rating in “Realistic” scores (Element Name) according to Interests?

# Code


(8) Question: Again, per Holland Codes, “Investigative” describes people who like to work with ideas, preferring observation over action and facts over feelings. Which top 20 occupation titles (OccupationData$Title) have the highest rating in “Investigative” scores (Element Name) according to Interests?

# Code


(9) Question: Job Zones loosely classify how advanced one typically must be as an occupational incumbent. How many unique occupation titles (OccupationData$Title) are classified in each Job Zone (JobZones)?

# Code


(10) Question: Of occupation titles (OccupationData$Title) that list “R” as the Example in TechnologySkills, what proportion is in each Job Zone (JobZones)?

# Code




Part 2: Joins & Salary Predictions

The following challenges require 3-4 merged tables and incorporate data from the U.S. Bureau of Labor Statistics’ 2021 Occupational Employment and Wage Statistics which, like other federal agencies, uses unique SOC codes for each occupation.


The following commands will read in BLS data and rename and reformat the columns to match O*NET.

url <- paste0("https://raw.githubusercontent.com/DS4PS/ays-r-cod",
              "ing-sum-2022/main/labs/bls_occupation_salaries.csv")

Salaries <- read_csv(url, 
                     col_select = c("O*NET-SOC Code" = "OCC_CODE", 
                                    "Title" = "OCC_TITLE", 
                                    "Median" = "A_MEDIAN")) %>% 
  mutate(`O*NET-SOC Code` = gsub(x = `O*NET-SOC Code`, 
                                 pattern = "$", 
                                 replacement = "\\.00"),
         Median = gsub(x = Median, 
                       pattern = ",", 
                       replacement = ""),
         Median = as.numeric(Median))


Note: There is one key distinction between ONET and BLS SOC codes. The 2019 O*NET Taxonomy features even more detailed profiles than typical federal agencies. Some more detailed ONET occupation profiles will not successfully merge with BLS occupations.


(1) Question: Join the O*NET OccupationData and BLS Salaries tables. Then, filter out any occupations with an NA value in variable Median. Store this new table as object Common. How many occupations have salary data available from BLS?

# Code

Use basic dplyr verbs to answer each question.


(2) Question: Join new object Common with object Interests. Note that each interest is in Element Name and each interest rating (on a 7-point scale) is listed in Data Value. In effect, we can now associate each “interest” (e.g.  artistic, social, enterprising) with median annual salaries.

Use the lm() function with formula Median ~ Element Name to print the coefficients of a simple linear model. Which two element names (not “High Points”) are most associated with higher salaries?

# Code


(3) Question: Now join Common with TechnologySkills. Filter occupations to only include the following technologies in Example, which have been provided for you in object tech:

  • R
  • C
  • Python
  • JavaScript
  • Microsoft Excel
  • Spreadsheet software
  • SAS statistical software
  • Hypertext markup language HTML

Group by technologies (Example) and determine the following summaries:

  • Average value for Median (annual salary)
  • Count for total occupations with said tech
tech <- c("R",
          "C++",
          "Python",
          "JavaScript",
          "Microsoft Excel",
          "Spreadsheet software",
          "SAS statistical software",
          "Hypertext markup language HTML")

# Code


(4) Question: Please rate your own work values by providing a score from 1 to 7 for the following value categories.

  • Achievement
  • Working Conditions
  • Recognition
  • Relationships
  • Support
  • Independence

A “7” indicates highest importance. A “1” indicates lowest importance. A default of 3.5 has been provided as an example.

my_values <- tibble::tibble("Achievement"        = 3.5, 
                            "Working Conditions" = 3.5,
                            "Recognition"        = 3.5,
                            "Relationships"      = 3.5,
                            "Support"            = 3.5,
                            "Independence"       = 3.5)


(5) Question: Join objects Common and WorkValues and filter the following values from variable Element Name:

  • “First Work Value High-Point”
  • “Second Work Value High-Point”
  • “Third Work Value High-Point”

Then, use select() on the following variables:

  • “O*NET-SOC Code”
  • “Element Name”
  • “Data Value”
  • “Median”

Pipe (%>%) your output into the following tidyr function:

# Code

tidyr::pivot_wider(names_from = "Element Name", 
                   values_from = "Data Value")


Lastly, use function lm() and predict() to estimate your salary based on your work values!

Make sure to set eval = TRUE for the following to run:

my_model <- lm(formula = Median ~ 
                 Achievement + 
                 `Working Conditions` + 
                 Recognition + 
                 Relationships + 
                 Support + 
                 Independence, 
               data = you_data)

predict(object = my_model, newdata = my_values)