My visual CV

career

data

viz

Author

Published

November 25, 2020

TLDR; I made a visual timeline of my career (CV) completely in R. Here I will show how it was done, and in the process how to hack ggplot objects, add images to plots, and use pretty much any font in your ggplots.

This is the final result, and if you are interested on how it was generated, read on.

Motivation

Recently I gave a presentation with an overview of my career, focusing on data analysis / bioinformatics. As my path to data analysis has not been straightforward, I wanted walk the audience through my career as we went along the several stages. Because I am someone who generally takes pride on giving good presentations, and this was about me after all, and I wanted it to be:

Clean and with an elegant design
Rich in features
Playful
With a wink a nod to data analysis and visualization

Basically me in a plot.

On top of that, I wanted to showcase my coding skills, so whatever I presented had to be done 100% with code. No post-production in Inkscape. In keeping with this mantra, the slide-deck itself was done with LaTeX.

Because the plot is so intrinsically linked with the presentation itself, as I explain the how the plot was done, I will include my though process on certain choices and how that is linked with the presentation¹.

Framework

Important for me, due to the way my career as paned out so far, and the goal of the presentation was to ²:

depict my journey from wet lab to data analysis, with increasing amounts of time spent analyzing data.
include a rough timeline should be present
highlight milestones such as academic publications and degrees
demonstrate how international the path has been

Initially I thought of using a Gantt chart as those are the go to visual representations for timelines and there are several packages available ³. However, after prototyping with pen and paper, I decided it would be too hard to incorporate all the features I wanted to have without cluttering the plot. Basically, it didn’t feel right.

Luckily inspiration was at hand in one of my favorite blogs, by an excellent data scientist and communicator, which just happened to have a similar path to data science as my own. Her visual CV is hand drawn, and looks great, but I made it a challenge to re-create that playfulness, along with the richness of information, in code (also I am terrible at drawing/painting).

Set-up

First step is to load the libraries I’ll be using:

tidyverse, for general data wrangling
lubridate, fancy functions to work with dates
data.table, rolling joins and general awesomeness
ggplot2, plots et al.
ggthemes, I don’t think this was actually used.
scales, functions to add some spice to ggplot axis
cowplot, add images to plots
extrafont, use of system fonts

Code

library("tidyverse")
library("lubridate")
library("data.table")
library("ggplot2")
library("ggthemes")
library("cowplot")
library("extrafont")
extrafont::loadfonts(quiet = TRUE)
library("bib2df")
library("rcrossref")

My CV in a table

The data itself is structured in a straightforward way with variables for my roles, dates, location, and a mock data analysis ratio (how much data analysis was part of my role). Because this is supposed to a cartoonish overview of my CV, not a rigorous representation, the data analysis ratio is close to reality but I won’t swear it is strictly correct.

Code

antonio <- fread("/home/antonio/Documents/personal/CV/visual_cv/timeline_cv.tsv") %>%
    melt(measure.vars = c("Start", "End"), variable.name = "Timpepoint",
        value.name = "Date") %>%
    .[, Date := as.IDate(Date)]

antonio %>%
    head() %>%
    knitr::kable()

Role	Place	Type	Country	da_ratio	Year	Timpepoint	Date
BSc	UA	Student	Portugal	0.2	2001	Start	1997-09-01
MSc	UA	Student	Portugal	0.3	2005	Start	2002-09-01
Technician	CNC	Service	Portugal	0.1	2006	Start	2005-05-01
PhD	University of Leicester	Student	UK	0.4	2009	Start	2006-02-01
Postdoc	MPI-CBG	Researcher	Germany	0.6	2013	Start	2009-10-01
Outreach	Internet	Package	World	1.0	2020	Start	2020-07-15

The Canvas

With the events of my CV in tabular format I could get started with plotting. Key features of the plot:

A theme with as few visual elements as possible, and attention to detail
The font carefully selected (luckily I have dozens of fonts installed in my system)
Addition of small pictograms

The first and second items were achieved with a custom theme and the package extrafont. extrafont allows us to use pretty much any system font in R plots. There a couple of steps that need to be done before system font can be used:

The first is extrafont::font_import() to detect and import system fonts. This needs to be done only once after installing the package, or any time a new font installed. This can take several minutes depending on how many system fonts your system has. Mine has a LOT.
The second step is to load the system fonts in the current R session to make them available, extrafont::loadfonts(quiet = TRUE)

Code

extrafont::font_import()
extrafont::loadfonts(quiet = TRUE)

fonttable() %>%
    filter(stringr::str_detect(FullName, "Annie"))

You can also use the third snippet of code make sure the desired font is indeed available in R.

The theme is a variation of theme_light to which I removed as many elements as I thought necessary give it that clean aspect. Mind you this is not for data visualization but rather to be aesthetically pleasing (ideally one should be able to combine both). The is also at this stage that the custom font was defined. By creating a theme in this manner we don’t need to add these lines to every plot.

Code

theme_set(theme_light())
theme_update(text = element_text(family = "Annie Use Your Telescope",
    size = 12), legend.title = element_text(size = 6), legend.text = element_text(size = 6),
    axis.title.y = element_blank(), axis.title.x = element_blank(),
    axis.text.y = element_blank(), axis.ticks.y = element_blank(),
    axis.line.x = element_line(size = 1, linetype = "solid",
        arrow = arrow(length = unit(0.1, "inches"), type = "closed")),
    panel.grid = element_line(colour = "black", linetype = "dashed",
        size = 0.5), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(), panel.grid.major.y = element_line(colour = "gray",
        linetype = "dotted", size = 0.5), panel.border = element_blank())

With the theme set I could produce the first, and very basic, plot:

Code

p <- ggplot(antonio, aes(x = Date, y = da_ratio)) + geom_smooth(method = "loess",
    color = "black", size = 1.5, span = 1, se = FALSE) + xlim(min(antonio$Date),
    as.IDate("2020-12-01")) + ylim(0, 1) + geom_hline(yintercept = 0.5,
    linetype = "dashed", color = "#c72e29", size = 1) + annotate(geom = "text",
    x = min(antonio$Date) %m+% months(6), y = 0.35, label = "Wet lab",
    color = "red", size = 5, family = "Annie Use Your Telescope") +
    annotate(geom = "text", x = as.IDate("2012-12-01"), y = 0.9,
        label = "Data analysis", color = "red", size = 5, family = "Annie Use Your Telescope")

p

The data science trajectory line is a smooth line, result of a linear model fitted to the time points in the CV. The model was chosen based on the very accurate statistical measure of “being easy on the eye” - smooth and in an upward trajectory after a particular time point.

To add the small icons, I ended up using magick (via cowplot) which I could only install with conda (well, mamba), due to missing dependencies. I did try the R way with install.packages("magick") but gave up after the second error. Life is too short.

The icons themselves were downloaded from the Noun Project which has tons of free to use icons (with attribution). I ended up paying for these because I liked them so much, but nevertheless I will credit the author, Tippawan Sookruay.

As these images will be included in several versions of the plot, I wrapped the code in a function to save typing:

Code

add_img <- function(gg) {
    ggdraw(gg) + draw_image("/home/antonio/Documents/personal/CV/visual_cv/figure/noun_pipette_3384602.svg.png",
        scale = 0.1, x = 0.09, y = 0.001, hjust = 0.52, vjust = 0.25) +
        draw_image("/home/antonio/Documents/personal/CV/visual_cv/figure/noun_chart_3384933.svg.png",
            scale = 0.1, x = 0.4, y = 0.25)
}

p_save <- add_img(p)

ggsave("timeline_icons.png", p_save, dpi = 600, width = 10, height = 6,
    units = "cm")

So now the plots has the basics:

A timeline on the x-axis
rough % of data analysis as part of my job
and a couple of cool looking icons to show how it all started and it is going

In technical terms, the most difficult part was to add the icons, and in particular placing them in the right location. Not only did it take a lot of trial an error, it was only just right once saved as a file (pdf). I wanted to have a wide plot to fit with the slides, and the plotting window is square, so some eye-balling was required. I tried a few other packages (patchwork for example), but in the end magick + cowplot was the best option for my purposes.

But we are not done yet: career milestones and cities I lived in are still missing.

Publications

Until now I have worked in fairly academic milieu where one’s achievements are measured mostly by publications. whilst I don’t particularly agree with this⁴, it is a decent proxy for output which is what I was looking to represent. Other fields will have their own metrics of productivity (number of sales, features shipped, etc).

For full reproducibility, I tried to get my publications list from pubmed or ORCID but ran into problems. The issue with pubmed queries is that my name follows Portuguese naming conventions, and each publisher seems to ignore what I write in the provided forms and makes up a different version of my name. Not helping matters is that my first and last names are fairly common, so there were a lot of false positives in the pubmed queries. I could retrieve some of the variables I needed from ORCID but publications dates were missing for some paper.

Luckily I keep all my publications in a bibtex file exported from zotero and ended up resorting to that. The publication data was extracted with the help of bib2df to read in the bibtex file, and rcrossref to retrieve publications dates which were incomplete or missing in the bibtex file.

Code

# ==========================================================================
# Get publication list
# ==========================================================================
path <- "/home/antonio/Documents/personal/CV/Awesome-CV/examples/resume/AD_papers.bib"

df <- bib2df(path)
setDT(df)
my_dois <- df[!is.na(DOI)]$DOI

# my_dois <-
# rorcid::identifiers(rorcid::works('0000-0002-1803-1863'))
pubs <- rcrossref::cr_works(dois = my_dois)$data %>%
    setDT() %>%
    .[, c("title", "type", "container.title", "published.print",
        "published.online", "deposited")] %>%
    .[, Date := ifelse(!is.na(published.print), published.print,
        published.online)] %>%
    .[, Date := ifelse(!is.na(Date), Date, deposited)] %>%
    .[, Date := ifelse(nchar(Date) == 4, paste(Date, "03-01",
        sep = "-"), Date)] %>%
    .[, Date := ifelse(nchar(Date) == 7, paste(Date, "01",
        sep = "-"), Date)] %>%
    .[, Date := as.IDate(Date)]

pubs %>%
    head() %>%
    knitr::kable()

title	type	container.title	published.print	published.online	deposited	Date
Innate Immune Training of Granulopoiesis Promotes Anti-tumor Activity	journal-article	Cell	2020-10	NA	2022-11-25	2020-10-01
The RNA binding protein human antigen R is a gatekeeper of liver homeostasis	journal-article	Hepatology	2022-04	2021-12-05	2023-06-07	2022-04-01
Exosomal miRNAs from Prostate Cancer Impair Osteoblast Function in Mice	journal-article	International Journal of Molecular Sciences	NA	2022-01-24	2022-01-26	2022-01-24
Membrane-associated cytoplasmic granules carrying the Argonaute protein WAGO-3 enable paternal epigenetic inheritance in Caenorhabditis elegans	journal-article	Nature Cell Biology	2022-02	2022-02-07	2022-10-23	2022-02-01
Bardet-Biedl syndrome proteins modulate the release of bioactive extracellular vesicles	journal-article	Nature Communications	NA	2021-09-27	2022-12-02	2021-09-27
Condensation of Ded1p Promotes a Translational Switch from Housekeeping to Stress Protein Production	journal-article	Cell	2020-05	NA	2022-06-25	2020-05-01

Once that was done, I could proceed to add the publications has milestones in the plot - one point per paper.

Not so fast.

Turns out that when that smooth line was created, a linear model fit on the original time points, I lost the reference points, ratio of data analysis, on the y-axis. So now I also need to estimate the Y positions in the plot.

ggplot hacking

I have done enough custom plots, and spent enough hours googling how to do it, to know that ggplot objects store a ton of information which can be accessed and modified (at your own risk). So what I did was to extract the x and y coordinates of the fitted line from the ggplot object.

This was done in several steps:

re-run the linear model (spline)
plot the fitted line
extract the coordinates from the plot (ggplot_build(p_smooth)$data[[1]])

Code

# Retrieve the smooth line (dummy timeline) to add points
# of interest
# --------------------------------------------------------------------------
spline_int <- as.data.frame(spline(antonio$Date, antonio$da_ratio))

p_smooth <- ggplot(antonio, aes(x = Date, y = da_ratio)) + geom_smooth(method = "loess",
    color = "black", size = 1.5, span = 1, se = FALSE) + xlim(min(antonio$Date),
    as.IDate("2020-12-01")) + ylim(0, 1)

smooth_df <- ggplot_build(p_smooth)$data[[1]] %>%
    setDT() %>%
    .[, Date := as.IDate(x)] %>%
    .[]

smooth_df %>%
    head() %>%
    knitr::kable()

x	y	flipped_aes	PANEL	group	colour	fill	linewidth	linetype	weight	alpha	Date
10105.00	0.2179275	FALSE	1	-1	black	grey60	1.5	1	1	0.4	1997-09-01
10211.34	0.2120627	FALSE	1	-1	black	grey60	1.5	1	1	0.4	1997-12-16
10317.68	0.2067649	FALSE	1	-1	black	grey60	1.5	1	1	0.4	1998-04-01
10424.03	0.2020354	FALSE	1	-1	black	grey60	1.5	1	1	0.4	1998-07-17
10530.37	0.1978752	FALSE	1	-1	black	grey60	1.5	1	1	0.4	1998-10-31
10636.71	0.1942856	FALSE	1	-1	black	grey60	1.5	1	1	0.4	1999-02-14

Now that I have the coordinates, these need to be matched with the publication dates. However, the x-axis values (dates) of the fitted plot are intervals which do not match the CV dates. To solve that problem, as nearly any data manipulation problem, data.table and rolling joins came to the rescue. The goal was to find the closest time point of my CV in the fitted plot.

Code

## find the closest date of publications in the dummy
## timeline
pubs_smooth <- smooth_df[pubs, roll = "nearest", on = "Date"] %>%
    .[, new_date := as.IDate(x)] %>%
    .[, Date := NULL] %>%
    setnames("new_date", "Date") %>%
    .[, c("Date", "y", "title")] %>%
    .[Date <= as.IDate("2012-01-01"), Publications := "Experimental"] %>%
    .[Date >= as.IDate("2012-01-01"), Publications := "Analyst"] %>%
    .[, Publications := ifelse(str_detect(title, "NXF1"),
        "Both", Publications)] %>%
    .[]

pubs_smooth %>%
    head() %>%
    knitr::kable()

Date	y	title	Publications
2020-09-01	0.9472880	Innate Immune Training of Granulopoiesis Promotes Anti-tumor Activity	Analyst
2020-09-01	0.9472880	The RNA binding protein human antigen R is a gatekeeper of liver homeostasis	Analyst
2020-09-01	0.9472880	Exosomal miRNAs from Prostate Cancer Impair Osteoblast Function in Mice	Analyst
2020-09-01	0.9472880	Membrane-associated cytoplasmic granules carrying the Argonaute protein WAGO-3 enable paternal epigenetic inheritance in Caenorhabditis elegans	Analyst
2020-09-01	0.9472880	Bardet-Biedl syndrome proteins modulate the release of bioactive extracellular vesicles	Analyst
2020-05-17	0.9417646	Condensation of Ded1p Promotes a Translational Switch from Housekeeping to Stress Protein Production	Analyst

Some of that information is not necessary but I kept it anyway to show how it looks.

A similar join was done for some key outreach dates that will also be highlighted in the final plot.

Code

## same for outreach activities
outreach <- antonio[Role == "Outreach" & Timpepoint == "Start"]

outreach_smooth <- smooth_df[outreach, roll = "nearest", on = "Date"] %>%
    .[, new_date := as.IDate(x)] %>%
    .[, Date := NULL] %>%
    setnames("new_date", "Date") %>%
    .[, c("Date", "y", "Type")] %>%
    .[]

Annotate with academic degrees

I could have used points (geom_point) with different color/shape to highlight my academic degrees, but these belong to a different category of events than publications or other more technical milestones / achievements, and would add too much information to the lines anyway. As an alternative I decided to use arrows and text to indicate when those degrees where obtained.

Again there is nothing magic or novel about this, in fact I took inspiration from Cedric Scherer, except the need for trial and error to get the locations correct. The positions of the arrows were put on a tibble to make it easier to plot later on:

Code

# plot annotations - academic points of interest
# --------------------------------------------------------------------------
arrows <- tibble(x1 = c(as.IDate("2000-09-01"), as.IDate("2004-12-01"),
    as.IDate("2010-09-01")), x2 = c(as.IDate("2001-09-01"), as.IDate("2005-05-01"),
    as.IDate("2009-08-01")), y1 = c(0.08, 0.15, 0.35), y2 = c(0.13,
    0.2, 0.4))
arrows

# A tibble: 3 × 4
  x1         x2            y1    y2
  <IDate>    <IDate>    <dbl> <dbl>
1 2000-09-01 2001-09-01  0.08  0.13
2 2004-12-01 2005-05-01  0.15  0.2 
3 2010-09-01 2009-08-01  0.35  0.4

Remember that in the plots show in the X-window, the arrows, text annotations, and icons where not in the correct locations. I kept saving the plots with the final dimensions to see if everything is in it’s right place. Saving as pdf gave me the best results vs saving as png. This is the same reason why I am saving the plots and linking to them in the Rmd rather than letting rmarkdown do all the work for me.

This is the final version:

Code

p_complete <- p + geom_jitter(data = pubs_smooth %>%
    arrange(Publications), aes(x = Date, y = y, fill = Publications),
    size = 2, alpha = 0.8, stroke = 1, shape = 21) + geom_point(data = outreach_smooth,
    aes(x = Date, y = y), fill = "#e37e00", size = 2, alpha = 0.7,
    stroke = 1, shape = 23) + scale_colour_stata() + scale_fill_stata() +
    guides(fill = guide_legend(reverse = TRUE)) + labs(fill = "Publications:") +
    annotate("text", x = as.IDate("1999-09-01"), y = 0.08, color = "gray20",
        lineheight = 0.9, label = "BSc", size = 4, family = "Annie Use Your Telescope") +
    annotate("text", x = as.IDate("2004-11-01"), y = 0.12, color = "gray20",
        lineheight = 0.9, label = "MSc", size = 4, family = "Annie Use Your Telescope") +
    annotate("text", x = as.IDate("2010-10-01"), y = 0.3, color = "gray20",
        lineheight = 0.9, label = "PhD", size = 4, family = "Annie Use Your Telescope") +
    geom_curve(data = arrows, aes(x = x1, y = y1, xend = x2,
        yend = y2), arrow = arrow(length = unit(0.07, "inch")),
        size = 0.4, color = "gray20", curvature = 0.3) + theme(legend.direction = "horizontal",
    legend.position = c(1, 0.05), legend.justification = c(1,
        0))

p_save <- add_img(p_complete)
ggsave("timeline_complete.png", p_save, dpi = 300, width = 10,
    height = 6, units = "cm")

What is the information contained in the figure:

icons depicting a start as a bench scientist and the current status as a data analyst
a line showing a smooth upward trajectory towards data science, but
data science only took over bench work (red line) at some point after my PhD
milestones (papers) are also depicted as round points, and in keeping with my trajectory there some to which I contributed only lab work, others with data analysis (one with both!).
the milestones with the diamond shape are to be described in the presentation (blog, R package maintainer).

I hope you can appreciate that this plot contains a lot of information, but doesn’t feel cluttered. Also, this is not meant to be a self-contained, self-explanatory piece of data visualization - I will be guiding the audience through the pieces of information and revealing them only when needed.

Getting there

Well, nearly done. There are still a couple of modifications to be done for the presentation:

No legend because I wanted to talk about the milestones during the presentation without giving way too early what the points meant;
Indication of the places of work is still missing.

The latter is crucial to the presentation because I will be highlighting each place of work, and follow that up with slides describing the my role, tasks, and accomplishments. So in effect highlighting the laces of work in a stepwise manner is critical to create breaks and setting the tone for each part of the presentation.

Removing the legend is fairly simple (once you know who to do it and I always google it).

Code

## no legend
p_no_leg <- p_complete + guides(fill = "none")
p_save <- add_img(p_no_leg)
ggsave("timeline_no_legend.png", p_save, dpi = 300, width = 10,
    height = 6, units = "cm")

To show how I moved around I made used of geom_annotation and shaded boxes to highlight each job, one at the time. Each one of these was then shown as bridge between parts of the talk.

For simplicity I am plotting only one here:

Code

## add countries

p_no_leg <- p_complete + guides(fill = "none")
add_img(p_no_leg)

Code

# ggsave('timeline_no_legend.pdf', dpi = 600, width = 10,
# height = 6, units = 'cm')

p_pt <- p_no_leg + annotate("rect", xmin = min(antonio$Date),
    xmax = as.Date("2006-02-01"), ymin = -Inf, ymax = Inf, fill = "#d9e6eb",
    alpha = 0.35)
p_save <- add_img(p_pt)

ggsave("timeline_no_legend.pt.png", p_save, dpi = 600, width = 10,
    height = 6, units = "cm")

Final words

And that was it. It took me quite some time to get it done, including testing packages which were not used, but in the end I am quite happy with the result. It was a lot of work, but I wanted to impress and I did.

Footnotes

Before any presentation I always think about who my audience will be and what is the goal. Only then do I start sketching the slides with pen and paper to organize the ideas into slides. One key message per slide.↩︎
For example https://github.com/laresbernardo/lares and https://github.com/daattali/timevis.↩︎
It’s fine to use the publication record, but someone career should not be just about that. There’s also teaching, mentoring, outreach, software, etc, which I am not sure it’s always taken into account. Not that it affects me because I am not looking to climb the academic ladder.↩︎
The slide-deck is not available, but if a few people show interest I could make put it online.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{domingues2020,
  author = {Domingues, António},
  title = {My Visual {CV}},
  date = {2020-11-25},
  url = {https://amjdomingues.com/posts/2020-11-25-visual-cv/},
  langid = {en}
}

For attribution, please cite this work as:

Domingues, António. 2020. “My Visual CV.” November 25, 2020. https://amjdomingues.com/posts/2020-11-25-visual-cv/.