
The datasets for week 1 are very interesting: Tweets with the #rstats and #TidyTuesday hashtags.

These datasets were obtained using the rtweet package, which is a very practical package for getting Twitter data. I’ve used a couple of other packages in the past (some years ago), but this one is very easy to use and contains many utility functions that can make your life easier.

For example, to get some tweets with the #TidyTuesday hashtag you can run this (previous configuration is needed):

rt <- search_tweets("#TidyTuesday", include_rts = FALSE)

For the purpose of this post, the dataset I’ve chosen to analyze is the one with the #TidyTuesday hashtag tweets. Let’s take a quick look at it:


tidytuesday_tweets <- read_rds(url(""))

tidytuesday_tweets %>%
For the purpose of this post, I will just select some columns:

tidytuesday_tweets <- tidytuesday_tweets %>%
  select(screen_name, status_id, created_at, text, retweet_count, favorite_count)

## # A tibble: 1,565 x 6
##    screen_name   status_id           created_at         
##    <chr>         <chr>               <dttm>             
##  1 Eeysirhc      1075264883367632896 2018-12-19 05:41:40
##  2 Eeysirhc      1072728604440580096 2018-12-12 05:43:24
##  3 Eeysirhc      1074488199907500032 2018-12-17 02:15:24
##  4 clairemcwhite 1075197256062644224 2018-12-19 01:12:56
##  5 stevie_t13    1075186982203084801 2018-12-19 00:32:07
##  6 stevie_t13    1072518335001169920 2018-12-11 15:47:52
##  7 thomas_mock   1074664472634109952 2018-12-17 13:55:51
##  8 thomas_mock   1075028993345245185 2018-12-18 14:04:19
##  9 thomas_mock   1072163012327415808 2018-12-10 16:15:56
## 10 thomas_mock   1075173440481579010 2018-12-18 23:38:18
##    text                                                                   
##    <chr>                                                                  
##  1 "#tidytuesday getting back into the keywords game this week with a lit~
##  2 "#tidytuesday despite what they say NY hotdog stands have some of the ~
##  3 "I realize it's a Sunday night but I wasn't too happy with my last #ti~
##  4 "@thomas_mock Maybe a shorter one line subtitle, like \"Weekly data pr~
##  5 #BeckyG #TidyTuesday                           
##  6 #ArianaGrande #thankunext #TidyTuesday         
##  7 "The @R4DScommunity welcomes you to week 38 of #TidyTuesday!  We're ex~
##  8 "#r4ds community, please vote on a #TidyTuesday hex-sticker! \n\n<U+26A0><U+FE0F><U+26A0><U+FE0F> ~
##  9 "The @R4DScommunity welcomes you to week 37 of #TidyTuesday!  We're ex~
## 10 "@jschwabish @jsonbaik @awunderground Hey @jschwabish this is an aweso~
##    retweet_count favorite_count
##            <int>          <int>
##  1             0              0
##  2             1              7
##  3             1              5
##  4             0              2
##  5             0              0
##  6             0              0
##  7             7             14
##  8             7             11
##  9             7             23
## 10             1              6
## # ... with 1,555 more rows


In this post, we’ll be exploring the words inside the tweets using wordclouds. This way, we can get a general idea of what the main topics were during the 2018 tidytuesdays.

## Warning: package 'tidytext' was built under R version 3.5.2
tidytuesday_words <- tidytuesday_tweets %>%
  unnest_tokens(word, text, token = "tweets")

## # A tibble: 42,485 x 6
##    screen_name status_id  created_at          retweet_count favorite_count
##    <chr>       <chr>      <dttm>                      <int>          <int>
##  1 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  2 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  3 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  4 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  5 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  6 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  7 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  8 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  9 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
## 10 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
## # ... with 42,475 more rows, and 1 more variable: word <chr>

Let’s quickly check the word frequency:

tidytuesday_words %>%
  count(word, sort = TRUE)
## # A tibble: 7,966 x 2
##    word             n
##    <chr>        <int>
##  1 #tidytuesday  1520
##  2 the           1377
##  3 to            1151
##  4 a              834
##  5 and            792
##  6 of             734
##  7 for            691
##  8 #rstats        634
##  9 i              627
## 10 in             579
## # ... with 7,956 more rows

After taking a look at the words, we need to remove some words that we know will affect the analysis. First, we have to remove the stopwords. Then we have to remove some words like “thomasmock” & “R4DScommunity”, which are the user names of the people managing the tidytuesdays, so they will appear a lot in this analysis. Obviously, we have to remove the word “#tidytuesday” (all the tweets contain this hashtag…that’s the reason we are doing this analysis):

tidytuesday_words <- tidytuesday_words %>%
  anti_join(stop_words, by = "word") %>%
  filter(!(word %in% c("@thomasmock", "@r4dscommunity", "#tidytuesday", "tidytuesday", "#rstats", "#r4ds",
                     "data", "week", "code", "#tidyverse", "#dataviz", "time", "weeks",
                     "submission", "plot", "plots", "dataset"))) %>%
  filter(!grepl("^http", word), grepl("[a-zA-Z]", word))

## # A tibble: 14,652 x 6
##    screen_name status_id  created_at          retweet_count favorite_count
##    <chr>       <chr>      <dttm>                      <int>          <int>
##  1 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  2 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  3 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  4 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  5 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  6 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  7 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  8 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
##  9 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
## 10 Eeysirhc    107526488~ 2018-12-19 05:41:40             0              0
## # ... with 14,642 more rows, and 1 more variable: word <chr>

Now, let’s try our first wordcloud with all the words we have:

tidytuesday_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 5) %>%
  ggplot(aes(label = word, size = n, colour = n)) +
    geom_text_wordcloud() +
    scale_size_area() +
    theme_minimal() +
    scale_colour_gradient2_tableau("Sunset-Sunrise Diverging")

Let’s give some shape to the wordcloud:

tidytuesday_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 5) %>%
  ggplot(aes(label = word, size = n, colour = n)) +
    geom_text_wordcloud(rm_outside = TRUE,
                        mask = png::readPNG("twitter_logo_black-nobackground.png")) +
    scale_size_area() +
    theme_minimal() +
    scale_colour_gradient2_tableau("Sunset-Sunrise Diverging")

That’s much nicer! Many words were removed in order to have a neat Twitter-logo-shaped wordcloud. I think some of these removed words are relevant to get a sense of the topics covered in 2018, so I decided to keep both wordclouds, the big one and this twitter-logo-shaped wordcloud (initially, I just wanted the Twitter-logo-shaped one).

And finally, to get a sense of the weekly topics, lets try a wordcloud by week:

tidytuesday_words %>%
  mutate(week = as.Date(floor_date(created_at, "week", week_start = 1))) %>%
  count(week, word, sort = TRUE) %>%
  group_by(week) %>%
  filter(row_number(desc(n)) <= 40) %>%
  mutate(sz = n / max(n)) %>%
  ungroup() %>%
  ggplot(aes(label = word, size = sz, colour = sz)) +
    geom_text_wordcloud(rm_outside = TRUE) +
    scale_radius(range = c(1, 5)) +  #scale_size_area(max_size = 5) +
    theme_minimal() +
    scale_colour_gradient2_tableau("Sunset-Sunrise Diverging") +
    facet_wrap(~week, ncol = 4, scales = "free")

Nice! Now we can easily spot the main topics covered during 2018 TidyTuesdays.


As you can see, you can get a good overview of the topics in a text by using just wordclouds. Some data cleaning is necessary, but the results are worthwhile.

All the code in here can be found in my github repo: TidyTuesdayCode