In this hypothetical scenario, pretend that you are a student currently learning the R language. Take your time and really get into the role.
We’re going to use public data from the U.S. Department of Labor in order to lightly research various R-related occupations, the values and interests of R-using occupational incumbents, the labor market value of the R language, and we conclude by predicting your future salaries based on your work values!
O*NET, or the Occupational
Information Network, is the admirable pursuit by the Department of
Labor to consolidate the classification of approximately one thousand
kinds of jobs. If that seems like a lot, it comes from the even more
voluminous DOT (Dictionary
of Occupational Titles), which defined over 13,000 jobs since 1938.
To do this, all federal agencies use a unique identifier for each job
type, known as a SOC or Standard
Occcupational Classification. SOC codes, typically variable
O*NET-SOC Code
, most often become the unique keys necessary
to joining our tables!
The O*NET Resource Center contains all individual tables used to power My Next Move and other platforms. Tables are organized according to myriad dimensions, seen here:
Let’s look at a few tables to get to know the data.
Read in one of the highest-level tables, Occupation Data, which contains:
<- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Occupation%20Data.txt"
url
<- read_delim(url, show = FALSE)
occupations
%>%
occupations head(3) %>% # Show first 3
pander()
O*NET-SOC Code | Title | Description |
---|---|---|
11-1011.00 | Chief Executives | Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers. |
11-1011.03 | Chief Sustainability Officers | Communicate and coordinate with management, shareholders, customers, and employees to address sustainability issues. Enact or oversee a corporate sustainability strategy. |
11-1021.00 | General and Operations Managers | Plan, direct, or coordinate the operations of public or private sector organizations, overseeing multiple departments or locations. Duties and responsibilities include formulating policies, managing daily operations, and planning the use of materials and human resources, but are too diverse and general in nature to be classified in any one functional area of management or administration, such as personnel, purchasing, or administrative services. Usually manage through subordinate supervisors. Excludes First-Line Supervisors. |
Now, let’s take a look at a more sophisticated table, Technology Skills, which contains:
Here, the first 6 observations all have the same SOC Code. That’s
because “Technology Skills” lists all technologies associated with the
occupation. Our preview shows the first 6 technologies important for
occupation SOC code 11-1011.00
, or that of “Chief
Executives”.
<- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Technology%20Skills.txt"
url
<- read_delim(url, show = FALSE)
technologies
%>%
technologies head(6) %>% # Show first 6
pander()
O*NET-SOC Code | Example | Commodity Code |
---|---|---|
11-1011.00 | Adobe Systems Adobe Acrobat | 43232202 |
11-1011.00 | AdSense Tracker | 43232306 |
11-1011.00 | Atlassian JIRA | 43232201 |
11-1011.00 | Blackbaud The Raiser’s Edge | 43232303 |
11-1011.00 | ComputerEase construction accounting software | 43231601 |
11-1011.00 | Database reporting software | 43232305 |
Commodity Title | Hot Technology |
---|---|
Document management software | Y |
Data base user interface and query software | N |
Content workflow software | Y |
Customer relationship management CRM software | N |
Accounting software | N |
Data base reporting software | N |
Recall that SOC Codes are unique identifiers for each occupation and are most often used fully, or in part, as merge keys.
We can merge these using dplyr
function
left_join()
, specifying argument by =:
%>%
occupations left_join(technologies,
by = `O*NET-SOC Code`)
Alternatively, we can use base R function merge()
:
merge(x = occupations,
y = technologies,
by = "O*NET-SOC Code")
In fact, if we simply use left_join()
without
identifying a merge column, it automatically detects merge keys with
common names and notifies us in the console.
%>%
occupations left_join(technologies) %>%
select(Title, `Commodity Title`) %>%
head(100)
## Joining, by = "O*NET-SOC Code"
Now that we’re capable of joining O*NET database tables, we can begin to glean our first insights. For example:
We can capture this quickly because our occupations-to-technologies merge is one-to-many, meaning that one record in one table corresponds to multiple records in another, adjoining table. Hence, if we count the number of new rows created by merging occupation titles and their many technologies, a simple row count will sum up to total unique technologies by occupation.
%>%
occupations left_join(technologies) %>%
group_by(Title) %>%
summarize(Technologies = n()) %>%
arrange(-Technologies) %>%
head(10)
## Joining, by = "O*NET-SOC Code"
Follow the instructions below. Click the following icon to download the Lab 06 template for your submission.
The following objects have been created for you from the O*NET 26.3 Database.
Abilities
measures innate or natural abilities of
occupational incumbentsAlternateTitles
lists over 53,000 alternate names for
O*NET occupationsExperience
measures various kinds of experience, albeit
encodedExperienceCategories
decodes measured experienceInterests
measures occupations with the Holland Code
(RIASEC) TestJobZoneReference
decodes classes of career
developmentJobZones
broad classes of career development for each
occupation; encodedKnowledge
defines areas of expertise and
importanceOccupationData
lists the occupation SOC, title, and
descriptionSkills
measures learned abilities of occupational
incumbentsTechnologySkills
lists all relevant technology
skillsWorkStyles
measures various performance indicatorsWorkValues
measures important occupation qualities
The following challenges require no more then 2-3 tables and use the
same column names across all tables. Use basic dplyr
verbs
to answer specific questions.
(1) Question: Which 10 occupation titles
(OccupationData$Title
) have the highest number of “hot
technologies” (TechnologySkills$HotTechnology
)?
# Code
(2) Question: Which 10 occupation titles
(OccupationData$Title
), with more than 15 unique
technologies, have the highest proportion (%) of “hot technologies”
(TechnologySkills$HotTechnology
)?
# Code
(3) Question: Which unique occupation titles
(OccupationData$Title
) have “R” listed as a technology
example (TechnologySkills$Example
)?
# Code
(4) Question: Which alternate occupation titles
(AlternateTitles
) have “R” listed as a technology example
(TechnologySkills$Example
)?
# Code
(5) Question: Which top 25 occupation titles
(OccupationData$Title
) have the highest level of
“Independence” (Element Name
) according to
WorkValues
?
# Code
(6) Question: Which top 25 occupation titles
(OccupationData$Title
) have the highest level of
“Persistence” (Element Name
) according to
WorkStyles
?
# Code
(7) Question: According to Holland Codes,
“Realistic” describes people who like to work with things, requiring
“motor coordination, skill, and strength”. Which top 20 occupation
titles (OccupationData$Title
) have the highest rating in
“Realistic” scores (Element Name
) according to
Interests
?
# Code
(8) Question: Again, per Holland Codes,
“Investigative” describes people who like to work with ideas, preferring
observation over action and facts over feelings. Which top 20 occupation
titles (OccupationData$Title
) have the highest rating in
“Investigative” scores (Element Name
) according to
Interests
?
# Code
(9) Question: Job Zones loosely classify how
advanced one typically must be as an occupational incumbent. How many
unique occupation titles (OccupationData$Title
) are
classified in each Job Zone
(JobZones
)?
# Code
(10) Question: Of occupation titles
(OccupationData$Title
) that list “R” as the
Example
in TechnologySkills
, what proportion
is in each Job Zone (JobZones
)?
# Code
The following challenges require 3-4 merged tables and incorporate data from the U.S. Bureau of Labor Statistics’ 2021 Occupational Employment and Wage Statistics which, like other federal agencies, uses unique SOC codes for each occupation.
The following commands will read in BLS data and rename and reformat the columns to match O*NET.
<- paste0("https://raw.githubusercontent.com/DS4PS/ays-r-cod",
url "ing-sum-2022/main/labs/bls_occupation_salaries.csv")
<- read_csv(url,
Salaries col_select = c("O*NET-SOC Code" = "OCC_CODE",
"Title" = "OCC_TITLE",
"Median" = "A_MEDIAN")) %>%
mutate(`O*NET-SOC Code` = gsub(x = `O*NET-SOC Code`,
pattern = "$",
replacement = "\\.00"),
Median = gsub(x = Median,
pattern = ",",
replacement = ""),
Median = as.numeric(Median))
Note: There is one key distinction between ONET and BLS SOC codes. The 2019 O*NET Taxonomy features even more detailed profiles than typical federal agencies. Some more detailed ONET occupation profiles will not successfully merge with BLS occupations.
(1) Question: Join the O*NET
OccupationData
and BLS Salaries
tables. Then,
filter out any occupations with an NA
value in variable
Median
. Store this new table as object Common
.
How many occupations have salary data available from BLS?
# Code
Use basic dplyr
verbs to answer each question.
(2) Question: Join new object Common
with object Interests
. Note that each interest is in
Element Name
and each interest rating (on a 7-point scale)
is listed in Data Value
. In effect, we can now associate
each “interest” (e.g. artistic, social, enterprising) with median
annual salaries.
Use the lm()
function with formula
Median ~ Element Name
to print the coefficients of a simple
linear model. Which two element names (not “High Points”) are most
associated with higher salaries?
# Code
(3) Question: Now join Common
with
TechnologySkills
. Filter occupations to only include the
following technologies in Example
, which have been provided
for you in object tech
:
Group by technologies (Example
) and determine the
following summaries:
Median
(annual salary)tech
<- c("R",
tech "C++",
"Python",
"JavaScript",
"Microsoft Excel",
"Spreadsheet software",
"SAS statistical software",
"Hypertext markup language HTML")
# Code
(4) Question: Please rate your own work values by
providing a score from 1
to 7
for the
following value categories.
A “7” indicates highest importance. A “1” indicates lowest
importance. A default of 3.5
has been provided as an
example.
<- tibble::tibble("Achievement" = 3.5,
my_values "Working Conditions" = 3.5,
"Recognition" = 3.5,
"Relationships" = 3.5,
"Support" = 3.5,
"Independence" = 3.5)
(5) Question: Join objects Common
and
WorkValues
and filter the following values from variable
Element Name
:
Then, use select()
on the following variables:
Pipe (%>%
) your output into the following
tidyr
function:
# Code
::pivot_wider(names_from = "Element Name",
tidyrvalues_from = "Data Value")
Lastly, use function lm()
and predict()
to
estimate your salary based on your work values!
Make sure to set eval = TRUE
for the following to
run:
<- lm(formula = Median ~
my_model +
Achievement `Working Conditions` +
+
Recognition +
Relationships +
Support
Independence, data = you_data)
predict(object = my_model, newdata = my_values)