Categories
The Engine Room

Link Segmentation Analysis

Link Segmentation Analysis

This piece reflects work completed during my time at Zoopla, tasked with a desperate business need to understand better our backlink profile. Please note that all data featured is publically available.

Link Segmentation
TOKYO, Japan — A busy rainy day in Tokoyo, with little room to navigate.
Harry Austen via Unplash
Harry Austen

by Harry Austen

Published Mar. 01, 2021 GMT

This analysis, completed for Zoopla, the UK property portal, studies the company and it’s competitors’ backlink profiles to understand the industry landscape and inform its next link-building campaign. As noted above, this data is publically available and therefore freely available for anyone to request provided relevant tooling.
The historical data was requested over a year ago and is not likely to be an accurate representation of where Zoopla currently sits in the market.

Grouped into 12 distinct categories, via the use of text analysis (which also featured in the Keyword Attribution Model piece), Zoopla and its competitors’ backlinks have been analysed to help remove the guesswork from future business and marketing efforts in regards to link-building.

Property 1 – property sites, estate agents, businesses that apply solely to property; Blog 2 – personal, advice and review blogs; News & Current Affairs 3 – publications that predominantly cover news and events of any kind (e.g. sports news, financial news); Financial & Business Services 4 – sites that discuss or offer financial and business services/management; Consumer Goods & Services 5 – consumer-facing businesses (property businesses consolidated in Property); Technology & Media 6 – agencies or sites that offer technology-based tools or services, multimedia organizations; Education 7 – schools, universities, student and teacher blogs (e.g. .edu); Gov 7 – government or state-owned domains (e.g. .gov); Directory 8 – service and goods directories, resources for finding a business; Travel, Transport & Leisure 9– holiday services, travel guides, advice and map-based blogs; Energy & Environment 10 – environmental organizations, energy providers and services; Void (not included below) – domains to be added to our disavow list (general spam) or excluded from future reporting (e.g. forum sites, estate agents that link to us in their footer)

First, I will review the current state of LRDs for both competitors, Rightmove and OnTheMarket, and then dive deeper into comparisons across the market.

Rightmove Keyword Profile

This graphic is interactive. Please hover over the bars to reveal Rightmove’s LRD data points.

  • As expected, the largest chunk of Rightmove’s backlink profile is property-related, with 2,847 LRDs. This group makes up 45% of their backlink profile.
  • Similarly to Zoopla, most of the property domains that point to Rightmove are low in authority, with 2.1k (75%) having a DA of 30 or less. Only 8% of domains fit within the ‘High’ DA group (equal to or higher than 60), which is fitting with our hypothesis that the property industry is congested with spam.
  • The next largest is blogs, with 951 LRDs. Much like Zoopla, the split of authority is healthy in the group, with 32% making up low-value, 38% making up mid and 30% making up high-authority. The group equates to 15% of total LRDs.
  • A close third, News and Current Affairs make up the next biggest chunk of Rightmove’s LRDs, with 881 (14%). This is a group that Rightmove is certainly over-performing in, with the majority of publications (588; 66%) being high in authority.
  • The number of LRDs per group preceding the news category starts to see a decline, those are: Technology & Media (414), Financial & Business Services (286), Education (224), Consumer Goods & Services (203), Gov (168), Travel, Transport & Leisure (116), Directory (108), Energy & Environment (4).

OnTheMarket LRD Profile

This graphic is interactive. Please hover over the bars to reveal OnTheMarket’s LRD data points.

  • The clear stand out from OnTheMarket is the high volume of property LRDs, which is their best performing group.
  • The property category makes up an astonishing 75% of their backlink profile. In total, of the 893 LRDs that make up the group, 86% (771) are low-value domains (below 30 DA). Although the amount of spam in the group is expected, this certainly isn’t a healthy ratio.
  • By comparison, Zoopla and Rightmove’s low-value property domains make up around 75% of their respective group.
  • While equally high, it is important to note that the two sites have other domains to, hypothetically speaking, fall back on. The next biggest group that sits in OTM’s backlink profile contains 818 fewer LRDs.
  • Compared to Zoopla (678) and Rightmove (881), OTM’s number of News LRDs is a drop in the ocean (75). This is a great example of why the two have more brand awareness – OTM does not get the same coverage as Zoopla or Rightmove.

Top-Level Industry Link Landscape

This graphic is interactive. Please hover over the bars to reveal the top-level LRD data points.

  • The largest deficit in terms of the volume of LRDs is property with 1.2k LRDs.
  • This group is also OnTheMarket’s strongest group, but are still 615 behind us. Rightmove have 2.2k different LRDs to Zoopla. However, only 431 of those fit within the Mid or High DA group. This means that the remaining 1.7k (80%) are low-value (below or equal to 30). A fairly large chunk appears to be companies/agents working in partnership with Rightmove. With that being said, there are still certainly authoritative prospects that are still up for grabs, which we will be adding to our future prospect list.
  • The next largest deficit is the blog category, where we are 262 behind Rightmove. This group is hard to gage, seeing as there are so many iterations (personal, corporate, general blog, etc.).
  • The blogs that link to Rightmove (and don’t to Zoopla) have a healthy split in DA, with 272 being low, 325 mid and 234 being high in authority.
  • By the looks of things, a big portion of Rightmove’s links in this group has come naturally and from a variety of different blogs (not just property-related).
  • News & Current Affairs is another group that we are being outperformed in (-240 LRDs), which was to be expected. The gaps in this group are mostly mixed between mid and high-DA sites, with notably more ‘global’ publications. 

This article is a member of The Engine Room. This section of the blog comprises programming and other technically challenging criteria in one place.

Harry Austen is a Data & Search Analyst. He has worked with the likes of Disney, The Olympics and Zoopla. @austenharry  

This page may change periodically, as and when more information becomes available. Therefore, please remain patient if you are experiencing longer load times than normal. As this article is subject the above, please also be patient for immediate updates.  The Blog | Subscribe

Categories
The Engine Room

Twitter Study: 1.2M Tweets in the UK Ranked

UK Twitter Study: 1.2M Tweet Segmentation Analysis

Twitter is a great tool to gauge public opinion. And during a time that has been testing, to say the least, what better time to catch up with the Great British public.

Young man on train, scrolling on phone - likely to be Twitter, according to data in this article.
The average time spend on Twitter per session is around three and a half minutes, making it a commuting favourite.
Harry Austen via Unplash
Harry Austen

by Harry Austen

Published Jan. 05, 2021 GMT

Updated Mar. 6, 2021 GMT

This article, using unique and original data, dives into the UK’s relationship with Twitter. It includes segmentation, the dispersion between Regions, and the nuances in tech in terms of device and other fascinating findings.

Before reviewing the findings, the first chunk of this article will be dedicated to explaining the rationale, use case, as well as how to emoulate and create your own dataset with the Twitter API.

You can find the criterion for each visual directly underneath the corresponding graph. All graphs are interactive, and I encourage you to play around with the Regions on the charts for yourself. For a more thorough explanation, you can see the documentation at the bottom of this page. The documentation contains a longer description, speaking around the data requested.

To find out how to start requesting data with the Twitter API, the linked article (above) starts right from the beginning straight to the point of calling the API in R, the program used to request the data. With clear documentation and fairly intuitive syntax, the process should not be as daunting as one may expect. This guide shows the exact steps I took to produce the data presented in this piece.

Although not directly tackled in the article mentioned above, accuracy and consistency were front-of-mind when collating the data points. Geo-specific targeting can invariably cause issues simply due to the size of the local authorities involved. The Region’s size and the number of heads (per capita) differ significantly from Region to Region. Aside from the first visual (due to its merit), the calculations extrapolate from values that are not the total number per Region. In total, 116 Regions feature in the study – the full list is in the documentation below.1

While there has been an attempt to colour Regions consistently, there are instances where that has not been possible. 2

Additionally, there has also been an attempt to keep the data (as) true (as possible) to its first raw form. With the volume of tweets in the study, I didn’t find a need to work around and start finding coefficients to weight one against the other. This would, in some way, dictate a bias towards a local authority. It is with that justification, that this dataset is unchanged.


Brief:

In the environment outlined above (and below in the documentation), collate tweets across the United Kingdom to discover social insights and, ultimately, figure out which Region tweets the most.

The agenda for this study is based on the significance of the variable metric and size of the areas involved. I will be discussing:
Top 10 Tweeting Regions by sum of tweets [1]
Top 10 Tweeting Regions Per Capita [2]
Highest number of tweets per device [3]

Top 10 UK Tweeting Regions

This first material is a broad look at the entire UK’s relationship with Twitter usage as previously outlined – accounting for the total number of tweets per Region.

This graphic is interactive. Please hover over the the bars to reveal the Regional data points. This data, recorded over one day (11th Nov. 2020), calculated the total sum of tweets per region – regardless of population and density.

The sum of total tweets doesn’t tell the whole story. As touched upon previously, the large local authorities, such as London, Manchester, Liverpool, etc. take a large portion with London taking the lionshare. It’s important to note that London is not represented singularly; North, South, West and East London are split individually, knowing the area would dominate, to ensure no assumed bias was acted on. Notwithstanding comments on expected outcomes, Oxford was certainly a surprise. However, as demonstrated this year with the epidemic, the Region certainly embrace technology and in a way, it does make sense to see them near the top ten. Bristol, the most densely populated region in the South West, is predictable, as, of course, is Birmingham. On balance, Edinburgh, being the most densely populated Scottish settlement was predictable. Interestingly, there are 208k people that live two miles away from Edinburgh City Centre, spatially they hold a staggering amount of land. In other words, although Birmingham hosts twice as many people, the two Regions occupy the same amount of land (Edinburgh Area: 264 km², Birmingham Area: Area: 267.8 km²).

Top 10 UK Regions, Tweets Per Capita

One step ahead of the sum of tweets per region, and a more accurate representation, is tweets per capita.4 This calculation is essentially my way of adding a dampening factor to settle the playing field for regions that are highly active but don’t have the number of residents to compete against a big city. Instead of totalling the sum, this metric looks broadly and accounts for population-to-tweet ratio (as a percentage).

This graphic is interactive. Please hover over the the bars to reveal the Regional data points. This data, recorded over one day (11th Nov. 2020), essentially, evens the playing field for small but active local authorities. By calculating per capita, a monopoly on population rightly doesn’t assume the biggest chunk of tweets. This is fairest measurement, as the method assures we only review tweets that Region is accountable for. 

London, which held a landslide majority (above) when broadly analysing highest number of tweets, is now 28th. The maths certainly assumes that the capital would see a huge drop with 8.9M inhabitants. In this respect, for some of the areas listed, it’s almost statistically impossible to compete. Notwithstanding the sum of tweets, turing attention to the graph above in the top spot is Cambridge. Fascinatingly, with under 5% of Edinburgh’s entire landscape, Cambridge has the most active populus on Twitter. The main factor I can surmise is the huge student footprint. With the area comprising of younger individuals, it may help explain the regions good showing. Before we touch on their rivals, it’s significant to note, albeit eighth, that Edinborough are still on the board. This suggests that it may be unfair to cast Edinburgh in the same false ‘big city’ theory. It’s hard to compare London to frankly any other region, but to a lesser extent the theory holds true to past top performers Birmingham (13th), Sheffield (14th) and Glasgow (16th). The rate at which holding beyond 500k in population starts to have diminishing returns is remarkably quick. To formally conclude, with 23% of the population tweeting during the requesting period, Cambridge are top. Second and, in line with the judgement made to Edinburgh, is Manchester with 20.39%. I have left the mancunians for the end as, unsurprisingly, the majority of tweets from the region was related to football. In fact, the official Man United Twitter page was the fifth most tweeted account. Ironically, the former president, Donald Trump, was the most tweeted account. He has since been chucked out of office and off the social platform, in a what was likely the correct, albeit swifty terrifying, move. Days after the infamous Capitol distrace, I wrote about Big Tech’s influence on democracy. The debate not falling with their decision to ban Trump but what more seriously lurks beyond with an unquestioned grip on democracy. The story, as always, will keep unfolding.

Most Popular Device on Twitter in the UK

This section takes a closer look beyond the location and focuses on the battle of the brands; Apple and Android holding biggest share between them. With that being said, one brand cleary has the upper hand.

This graphic is interactive. Please hover over the the bars to reveal the Device data points. This data, recorded over one day (11th Nov. 2020), shows Apple’s brand dominance on Twitter. The gap between Android (6,533 tweets) and Apple (525,992) is truly significant.

Often device data can appear misleading and even uninteresting. This is not one of those cases. Moreover, with Twitter being millions of users ‘port of call’, the brand battle between Apple and Co. is not least important for image but sales. Instead of either organisation betting on ads to win over potential customers, what better advertisement than ‘Send via iPhone’ on a never-ending loop.
That’s of course an exaggeration, but it’s not a million miles away from the truth when digging through the data. Firstly, it’s worth noting out of the gate that Apple has a strong hold on coverage across the UK. Incredibly, they lead in 113 out of 116 local authorities. Put differently, out of the total regions collected, there are only three instances in which they don’t have a majority share. Interestingly, Oxford and Cambridge, who both clearly have an active interest in Twitter, both show a stronger alliance with Desktops than smartphones. Echoing a similar hypothesis to the one previously mentioned, one would assume that this is influenced by large student population. Aside one extra Region, Apple is top across the board. During one of day of tracking, 525,992 tweets were sent using an iPhone.

Documentation

Complete list of Regions 1

1. Aberdeen
2. Bath
3. Birmingham
4. Blackburn
5. Blackpool
6. Bolton
7. Bournemouth
8. Bradford
9. Brighton
10. Bristol
11. Bromley
12. Cambridge
13. Canterbury
14. Cardiff
15. Carlisle
16. Central London
17. Chelmsford
18. Chester
19. Colchester
20. Coventry
21. Crewe
22. Croydon
23. Darlington
24. Dartford
25. Derby
26. Doncaster
27. Dorchester
28. Dudley
29. Dumfries & Galloway
30. Dundee
31. Durham
32. East London
33. Edinburgh
34. Enfield
35. Exeter
36. Falkirk And Stirling
37. Galashiels
38. Glasgow
39. Gloucester
40. Guildford
41. Halifax
42. Harrogate
43. Harrow
44. Hemel Hempstead
45. Hereford
46. Huddersfield
47. Hull
48. Ilford
49. Inverness
50. Ipswich
51. Kilmarnock
52. Kingston Upon Thames
53. Kirkcaldy
54. Kirkwall
55. Lancaster
56. Leicester
57. Lerwick
58. Lincoln
59. Liverpool
60. Llandrindod Wells
61. Llandudno
62. Luton
63. Manchester
64. Milton Keynes
65. Motherwell
66. Newcastle Upon Tyne
67. Newport
68. North London
69. North West London
70. Northampton
71. Northern Ireland
72. Norwich
73. Nottingham
74. Oldham
75. Outer Hebrides
76. Oxford
77. Paisley
78. Perth
79. Peterborough
80. Plymouth
81. Portsmouth
82. Preston
83. Reading
84. Redhill
85. Rochester
86. Romford
87. Salisbury
88. Sheffield
89. Shrewsbury
90. Slough
91. South East London
92. South West London
93. Southall
94. Southampton
95. Southend-On-Sea
96. St Albans
97. Stevenage
98. Stockport
99. Sunderland
100. Sutton
101. Swindon
102. Taunton
103. Telford
104. Tonbridge
105. Torquay
106. Truro
107. Twickenham
108. Wakefield
109. Walsall
110. Warrington
111. Watford
112. West London
113. Wigan
114. Wolverhampton
115. Worcester
116. York

Regional Colour Scheme 2

Due to the software limitation of the number of bars per graph, there are instances of non-conforming Regional colour schemes. Unfortunately, this is unavoidable.

Output 3

Due to the software limitation of the number of bars per graph, there are instances of non-conforming Regional colour schemes. Unfortunately, this is unavoidable.

Tweets Per Capita 4

When comparing sum of tweets vs the physical number of bodies in that region (per capita), otherwise missed connections start to be formed. Without focusing on the regional parameters (pros and cons), you aren’t going to find anything that hasn’t already been reported or understood.

This article is a member of The Engine Room. This section of the blog comprises programming and other technically challenging criteria in one place.

Harry Austen is a Data & Search Analyst. He has worked with the likes of Disney, The Olympics and Zoopla. @austenharry  

This page may change periodically, as and when more information becomes available. Therefore, please remain patient if you are experiencing longer load times than normal. As this article is subject the above, please also be patient for immediate updates.  The Blog | Subscribe

Categories
The Engine Room

Keyword Attribution Model, Text Analysis in R

Text Analysis R: Keyword Attribution Model

This is a guide to identifying groups and audiences, with better alignment with content and user intent. Equally, this methodology ensures focus on attainable keywords and helps in discovering otherwise missed keyword clusters. With a logical approach, you can explain the keyword process you take to both clients and stakeholders with more clarity.

Thumbnail for Keyword Attribution article, pictured the Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia.
KUALA LUMPUR, Malaysia — Federal Territory of Kuala Lumpur, Malaysia, pictured in all its colourful, segmented glory.
Saloma Link/Unsplash
Harry Austen

by Harry Austen

Published Dec. 07, 2020 GMT

Updated Mar. 6, 2021 GMT

To download the files used in the BrightonSEO talk you have (or are currently) watching quickly, please use the links below. If you face issues, right-click the link and select “Save as.”

Keyword group database: click here.
The batch of keywords used in the talk: click here.
R-Studio and R(language): click here and here.

Both Mac and Windows can run R code and are both readily available to download (for free) on the active links above. You will find the full script and tutorial at the bottom of this page.

The descriptions of the functions in R are in the BrightonSEO video. If you haven’t watched or are still stuck, I would recommend breaking down the packages one-by-one and searching online for definitions. Equally, if you scroll beyond this intro section, I detail the steps and explain most of the script’s function.

If you didn’t attend BrightonSEO, you could purchase the full video via the HeySummit video library.

Below you will find the steps accompanying the conference talk.

Table of Contents

Introduction to Keyword Attribution

Generally speaking, most keyword research is in PPC-form, with cost per click and metrics of that nature featured within a tool used for the opposite search scenario. The likes of Keyword Planner helps to highlight this. The service, which has seen upgrades now-and-then, is a tool most folks in Search use for keyword analysis. Using this tool is unwise. Not only that, but it can cause misleadings in search volume. Equally, on the whole, it can lead to short-term thinking in the hope of a quick win.

With attribution, you can identify audiences and keyword groups, which better align with the terms mentioned above for both content and user intent. With a real focus on attainable keywords and drawing your attention to keyword clusters you have missed; otherwise, this is a logical approach that you can easily show to both clients and stakeholders to explain your reasoning.

In line with the video tutorial, I will be assuming that you already have R installed and are ready to start working on the script to manipulate the downloadable batch of keywords (as CSV, above). The keyword database that will direct the groups used for filtering is in the same section.

The preface for the analysis is that work with a client who is a clothing retailer and who want to find more relevant keywords that they may not have considered targeting.

Install & Library Packages

To manipulate data with the function used in this script, you will need to install and library (save to R’s memory) the following packages.

# Install Packages
install.packages("dplyr")
install.packages("data.table")
install.packages("readr")
# Library Packages
library(dplyr)
library(data.table)
library(readr) # reading in the csv with search volume needing to be turned into characters

Import Keyword Dataset (CSV)

As noted above, you can install the document used during the BrightonSEO talk. If you haven’t watched the video, the document is a raw sample containing clothing keywords. The export is an exported Keyword Planner file.

# Import CSV and save as 'keyword dataset' dataframe
keyword_dataset <- read.csv("keyword_dataset.csv", stringsAsFactors = FALSE)

Data Preprocessing, Subsetting

Essentially, the process used to extract specific keywords, known as ‘text analysis’, requires an exact match. In other words, you need to have a list of keywords that are the same, both in spelling and format, as the database of keywords you are searching in.

# Data Preprocessing
keyword_dataset$Keyword <- tolower(keyword_dataset$Keyword) # Housekeeping - make all lowercase; formatting is consistient (avoid missing keywords)
# Remove unnecessary Keyword Planner metrics
keyword_dataset <- subset(keyword_dataset, select = -c(CPC, Previous.position, Number.of.Results, Trends, SERP.Features.by.Keyword, URL))

It is always best to ensure there are no discrepancies between a) the list of keywords and b) the exported dataset. To resolve any potential formatting errors, I use the ‘tolower’ function. The function makes the list of keywords in the dataset lowercase. It does mean that the list of keywords you search with when filtering later must be in lowercase. However, it is an easy solution and ensures that you have not got to worry about double-checking for individual capital letters in what may be a list of thousands of keywords.

While future variations may change depending on where you generated your raw file of keywords, often, Keyword Planner will provide you metrics you don’t need. Using the ‘subset’ function, you can remove them.

Example of Filtering Keywords

Below is an example of how to filter for keywords. Essentially, the function works by first selecting the dataframe you wish to query [keyword_dataset], followed by the column name with a dollar sign ($) in front of name [$Keyword] followed by a ‘keyword’ to target, which you are indicating via the like function.

I have included three examples, which show what you will eventually pull off at a larger scale (with more keywords).

# filter exactly for a value (keyword) - 
shoes_keyword_dataset <- keyword_dataset %>% 
   filter(keyword_dataset$Keyword %like% "shoes")

# filter exactly for a value (keyword) - within the KEYWORD column($) - like (%like%) 'SHOES'
shoes_keyword_dataset <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'shoes')

# filter exactly for a value (keyword) - within the KEYWORD column($) - like (%like%) 'CLOTHING'
clothing_keyword_dataset <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'clothing')

OR Logical Operator

Based on the logic above, you should now understand how to filter the raw file with single keywords. When it comes to filtering for multiple keywords, you need to add the ‘OR’ operator [|]. Seeing as R runs the script from left-to-right, you need to indicate at the end of each closing string that this is a variable against several values (keywords). With the added operator, R will now store the previous keywords to memory and collate the ‘hits’ to the dataframe. If not, R will read the strings individually, stopping before the second request. If you have ever used JavaScript, you will be familiar with the functionality, which is almost identical (||) to the one used in R.

Note on Groups, First Filter

It is important to note that while the database may be relevant for most, you may find that some may not. In which case, please omit these sections from your script and only include the relevant passages. As directed by the keyword database (download above), the first group is keywords relating to questions.

# Question-based Keywords
question_keywords <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'how to' |
   keyword_dataset$Keyword %like% 'what' |
   keyword_dataset$Keyword %like% 'how' |
   keyword_dataset$Keyword %like% 'when' |
   keyword_dataset$Keyword %like% 'who' |
   keyword_dataset$Keyword %like% 'what' |
   keyword_dataset$Keyword %like% 'where' |
   keyword_dataset$Keyword %like% 'when' |
   keyword_dataset$Keyword %like% 'why ' |
   keyword_dataset$Keyword %like% 'how many' |
   keyword_dataset$Keyword %like% 'near me')

Remaining Groups to Filter

The following passage contains the twelve remaining groups. Please scroll within the snippet or use the copy button (top-right) to see/paste the expanded code.


# General / Clothing
general_clothing <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'outfits' |
             keyword_dataset$Keyword %like% 'overshirt' |
             keyword_dataset$Keyword %like% 'jacket' |
             keyword_dataset$Keyword %like% 'Party' |
             keyword_dataset$Keyword %like% 'sleeve' |
             keyword_dataset$Keyword %like% 'socks' |
             keyword_dataset$Keyword %like% 'cashmere' |
             keyword_dataset$Keyword %like% 'cardigan' |
             keyword_dataset$Keyword %like% 'sweatshirt' |
             keyword_dataset$Keyword %like% 'sweater' |
             keyword_dataset$Keyword %like% 'jumper' |
             keyword_dataset$Keyword %like% 'shirt' |
             keyword_dataset$Keyword %like% 'dress' |
             keyword_dataset$Keyword %like% 'flannel' |
             keyword_dataset$Keyword %like% 'Crocs' |
             keyword_dataset$Keyword %like% 'suits' |
             keyword_dataset$Keyword %like% 'tee' |
             keyword_dataset$Keyword %like% 'leggings' |
             keyword_dataset$Keyword %like% 'palazzo pants' |
             keyword_dataset$Keyword %like% 'jeans' |
             keyword_dataset$Keyword %like% 'polo' |
             keyword_dataset$Keyword %like% 'tracksuit' |
             keyword_dataset$Keyword %like% 'jumpers' |
             keyword_dataset$Keyword %like% 'tie' |
             keyword_dataset$Keyword %like% 'trainers' |
             keyword_dataset$Keyword %like% 'outfits' |
             keyword_dataset$Keyword %like% 'overshirt' |
             keyword_dataset$Keyword %like% 'jacket' |
             keyword_dataset$Keyword %like% 'party' |
             keyword_dataset$Keyword %like% 'sleeve' |
             keyword_dataset$Keyword %like% 'socks' |
             keyword_dataset$Keyword %like% 'cashmere' |
             keyword_dataset$Keyword %like% 'cardigan' |
             keyword_dataset$Keyword %like% 'sweatshirt' |
             keyword_dataset$Keyword %like% 'sweater' |
             keyword_dataset$Keyword %like% 'jumper' |
             keyword_dataset$Keyword %like% 'shirt' |
             keyword_dataset$Keyword %like% 'dress' |
             keyword_dataset$Keyword %like% 'flannel' |
             keyword_dataset$Keyword %like% 'Crocs' |
             keyword_dataset$Keyword %like% 'suits' |
             keyword_dataset$Keyword %like% 'tee' |
             keyword_dataset$Keyword %like% 'leggings' |
             keyword_dataset$Keyword %like% 'jeans' |
             keyword_dataset$Keyword %like% 'polo' |
             keyword_dataset$Keyword %like% 'tracksuit' |
             keyword_dataset$Keyword %like% 'jumpers' |
             keyword_dataset$Keyword %like% 'tie' |
             keyword_dataset$Keyword %like% 'trainers' |
             keyword_dataset$Keyword %like% 'shorts' |
             keyword_dataset$Keyword %like% 'yoga' |
             keyword_dataset$Keyword %like% 'gear' |
             keyword_dataset$Keyword %like% 'cleat' |
             keyword_dataset$Keyword %like% 'jersey' |
             keyword_dataset$Keyword %like% 'merch' |
             keyword_dataset$Keyword %like% 't shirts' |
             keyword_dataset$Keyword %like% 'workwear' |
             keyword_dataset$Keyword %like% 'tank top' |
             keyword_dataset$Keyword %like% 'top' |
             keyword_dataset$Keyword %like% 'pants' |
             keyword_dataset$Keyword %like% 'vans' |
             keyword_dataset$Keyword %like% 'sportswear' |
             keyword_dataset$Keyword %like% 'fleece' |
             keyword_dataset$Keyword %like% 'undergarments' |
             keyword_dataset$Keyword %like% 'shirts' |
             keyword_dataset$Keyword %like% 'shirt' |
             keyword_dataset$Keyword %like% 'coat' |
             keyword_dataset$Keyword %like% 'wenven' |
             keyword_dataset$Keyword %like% 'knitwear')

general_clothing$Group <- "General Clothing"

# Female / Clothing
female_clothing <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'womens trousers' |
             keyword_dataset$Keyword %like% 'swimwear' |
             keyword_dataset$Keyword %like% 'womens swimwear' |
             keyword_dataset$Keyword %like% 'hoodies' |
             keyword_dataset$Keyword %like% 'shorts & skirts' |
             keyword_dataset$Keyword %like% 'womens outfits' |
             keyword_dataset$Keyword %like% 'mens multipacks' |
             keyword_dataset$Keyword %like% 'womens swimwear' |
             keyword_dataset$Keyword %like% 'womens womens swimwear' |
             keyword_dataset$Keyword %like% 'womens hoodies' |
             keyword_dataset$Keyword %like% 'womens shorts & skirts' |
             keyword_dataset$Keyword %like% 'womens womens outfits' |
             keyword_dataset$Keyword %like% 'womens mens multipacks' |
             keyword_dataset$Keyword %like% 'swimwear' |
             keyword_dataset$Keyword %like% 'womens swimwear' |
             keyword_dataset$Keyword %like% 'hoodies' |
             keyword_dataset$Keyword %like% 'shorts & skirts' |
             keyword_dataset$Keyword %like% 'womens trousers' |
             keyword_dataset$Keyword %like% 'womens pyjamas' |
             keyword_dataset$Keyword %like% 'womens pyjama tops' |
             keyword_dataset$Keyword %like% 'womens pyjama bottoms' |
             keyword_dataset$Keyword %like% 'womens pyjama sets' |
             keyword_dataset$Keyword %like% 'womens dressing gowns' |
             keyword_dataset$Keyword %like% 'womens nightdresses' |
             keyword_dataset$Keyword %like% 'womens slippers' |
             keyword_dataset$Keyword %like% 'womens character' |
             keyword_dataset$Keyword %like% 'womens short pyjamas' |
             keyword_dataset$Keyword %like% 'womens bride & hen' |
             keyword_dataset$Keyword %like% 'womens loungewear' |
             keyword_dataset$Keyword %like% 'womens womens joggers' |
             keyword_dataset$Keyword %like% 'womens pyjamas' |
             keyword_dataset$Keyword %like% 'womens pyjama tops' |
             keyword_dataset$Keyword %like% 'womens pyjama bottoms' |
             keyword_dataset$Keyword %like% 'womens pyjama sets' |
             keyword_dataset$Keyword %like% 'womens dressing gowns' |
             keyword_dataset$Keyword %like% 'womens nightdresses' |
             keyword_dataset$Keyword %like% 'womens slippers' |
             keyword_dataset$Keyword %like% 'womens short pyjamas' |
             keyword_dataset$Keyword %like% 'womens outfits' |
             keyword_dataset$Keyword %like% 'womens overshirt' |
             keyword_dataset$Keyword %like% 'womens jacket' |
             keyword_dataset$Keyword %like% 'womens workwear' |
             keyword_dataset$Keyword %like% 'womens fleece' |
             keyword_dataset$Keyword %like% 'womens undergarments')

female_clothing$Group <- "Female Clothing"



# General / Brand
general_brand <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'nike' |
             keyword_dataset$Keyword %like% 'air force one' |
             keyword_dataset$Keyword %like% 'air max' |
             keyword_dataset$Keyword %like% 'infinity' |
             keyword_dataset$Keyword %like% 'react ' |
             keyword_dataset$Keyword %like% 'mercurial' |
             keyword_dataset$Keyword %like% 'air max' |
             keyword_dataset$Keyword %like% 'revolution' |
             keyword_dataset$Keyword %like% 'max' |
             keyword_dataset$Keyword %like% 'todos' |
             keyword_dataset$Keyword %like% 'lebron' |
             keyword_dataset$Keyword %like% 'pegasus' |
             keyword_dataset$Keyword %like% 'air zoom' |
             keyword_dataset$Keyword %like% 'blazer' |
             keyword_dataset$Keyword %like% 'vista lite' |
             keyword_dataset$Keyword %like% 'max bella' |
             keyword_dataset$Keyword %like% 'adidas' |
             keyword_dataset$Keyword %like% 'jordon' |
             keyword_dataset$Keyword %like% 'converse' |
             keyword_dataset$Keyword %like% 'orolay' |
             keyword_dataset$Keyword %like% 'yuedge' |
             keyword_dataset$Keyword %like% 'yeezy' |
             keyword_dataset$Keyword %like% 'boost 350 ' |
             keyword_dataset$Keyword %like% 'wave runner' |
             keyword_dataset$Keyword %like% 'woodstock' |
             keyword_dataset$Keyword %like% 'vans' |
             keyword_dataset$Keyword %like% 'arcteryx ' |
             keyword_dataset$Keyword %like% 'big and tall' |
             keyword_dataset$Keyword %like% 'bjj gi' |
             keyword_dataset$Keyword %like% 'black and white')


general_brand$Group <- "General Brand"


# General / Transaction
general_transaction <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'buy' |
             keyword_dataset$Keyword %like% 'purchase' |
             keyword_dataset$Keyword %like% 'sell' |
             keyword_dataset$Keyword %like% 'transaction' |
             keyword_dataset$Keyword %like% 'merchant' |
             keyword_dataset$Keyword %like% 'shop' |
             keyword_dataset$Keyword %like% 'sale' |
             keyword_dataset$Keyword %like% 'promo' |
             keyword_dataset$Keyword %like% 'clearance')

general_transaction$Group <- "General Transaction"



# Female / Footware
female_footware <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'womens boots' |
             keyword_dataset$Keyword %like% 'ankle boots' |
             keyword_dataset$Keyword %like% 'womens trainers' |
             keyword_dataset$Keyword %like% 'womens flats' |
             keyword_dataset$Keyword %like% 'heels' |
             keyword_dataset$Keyword %like% 'womens sandals ' |
             keyword_dataset$Keyword %like% 'ballet shoes' |
             keyword_dataset$Keyword %like% 'leather' |
             keyword_dataset$Keyword %like% 'slippers' |
             keyword_dataset$Keyword %like% 'wellies' |
             keyword_dataset$Keyword %like% 'flip flops' |
             keyword_dataset$Keyword %like% 'pumps' |
             keyword_dataset$Keyword %like% 'suede' |
             keyword_dataset$Keyword %like% 'womens boots' |
             keyword_dataset$Keyword %like% 'womens ankle boots' |
             keyword_dataset$Keyword %like% 'womens trainers' |
             keyword_dataset$Keyword %like% 'womens flats' |
             keyword_dataset$Keyword %like% 'womens heels & wedges' |
             keyword_dataset$Keyword %like% 'womens sandals' |
             keyword_dataset$Keyword %like% 'womens ballet shoes' |
             keyword_dataset$Keyword %like% 'womens leather' |
             keyword_dataset$Keyword %like% 'womens slippers' |
             keyword_dataset$Keyword %like% 'womens wellies' |
             keyword_dataset$Keyword %like% 'womens flip flops' |
             keyword_dataset$Keyword %like% 'womens pumps' |
             keyword_dataset$Keyword %like% 'womens suede' |
             keyword_dataset$Keyword %like% 'womens shoes' |
             keyword_dataset$Keyword %like% 'womens cleats')


female_footware$Group <- "Female Footware"

# Female Nightware
female_nightwear <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'womens joggers' |
             keyword_dataset$Keyword %like% 'womens pyjamas' |
             keyword_dataset$Keyword %like% 'womens pyjama tops' |
             keyword_dataset$Keyword %like% 'womens pyjama bottoms' |
             keyword_dataset$Keyword %like% 'womens pyjama sets' |
             keyword_dataset$Keyword %like% 'womens dressing gowns' |
             keyword_dataset$Keyword %like% 'womens nightdresses' |
             keyword_dataset$Keyword %like% 'womens slippers' |
             keyword_dataset$Keyword %like% 'womens character' |
             keyword_dataset$Keyword %like% 'womens short pyjamas' |
             keyword_dataset$Keyword %like% 'womens bride & hen' |
             keyword_dataset$Keyword %like% 'womens loungewear' |
             keyword_dataset$Keyword %like% 'womens womens joggers')

   
female_nightwear$Group <- "Female Footware"



# Female / Lingerie
female_lingerie <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'bras' |
             keyword_dataset$Keyword %like% 'knickers' |
             keyword_dataset$Keyword %like% 'lingerie sets' |
             keyword_dataset$Keyword %like% 'maternity lingerie' |
             keyword_dataset$Keyword %like% 'entice' |
             keyword_dataset$Keyword %like% 'nude lingerie' |
             keyword_dataset$Keyword %like% 'shapewear' |
             keyword_dataset$Keyword %like% 'bodysuits' |
             keyword_dataset$Keyword %like% 'womens bras' |
             keyword_dataset$Keyword %like% 'womens knickers' |
             keyword_dataset$Keyword %like% 'womens lingerie sets' |
             keyword_dataset$Keyword %like% 'womens maternity lingerie' |
             keyword_dataset$Keyword %like% 'womens entice' |
             keyword_dataset$Keyword %like% 'womens nude lingerie' |
             keyword_dataset$Keyword %like% 'womens shapewear' |
             keyword_dataset$Keyword %like% 'womens bodysuits')
   
female_lingerie$Group <- "Female Lingerie"


# Female / Maternity
female_maternity <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'maternity bottoms' |
             keyword_dataset$Keyword %like% 'maternity coats' |
             keyword_dataset$Keyword %like% 'maternity dresses' |
             keyword_dataset$Keyword %like% 'maternity jeans' |
             keyword_dataset$Keyword %like% 'maternity leggings' |
             keyword_dataset$Keyword %like% 'maternity lingerie' |
             keyword_dataset$Keyword %like% 'maternity multipacks' |
             keyword_dataset$Keyword %like% 'maternity nightwear' |
             keyword_dataset$Keyword %like% 'maternity swimwear' |
             keyword_dataset$Keyword %like% 'maternity tops' |
             keyword_dataset$Keyword %like% 'nursing clothes' |
             keyword_dataset$Keyword %like% 'mamalicious clothes' |
             keyword_dataset$Keyword %like% 'womens shorts' |
             keyword_dataset$Keyword %like% 'womens  skirts' |
             keyword_dataset$Keyword %like% 'womens multipacks' |
             keyword_dataset$Keyword %like% 'womenswear' |
             keyword_dataset$Keyword %like% 'mensware')

# Male / Footware
male_footwear <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'mens trainers' |
             keyword_dataset$Keyword %like% 'mens boots ' |
             keyword_dataset$Keyword %like% 'mens formal shoes' |
             keyword_dataset$Keyword %like% 'mens leather & suede' |
             keyword_dataset$Keyword %like% 'mens slippers' |
             keyword_dataset$Keyword %like% 'mens sneaker' |
             keyword_dataset$Keyword %like% 'sneaker' |
             keyword_dataset$Keyword %like% 'mens shoes' |
             keyword_dataset$Keyword %like% 'pumps' |
             keyword_dataset$Keyword %like% 'wellies' |
             keyword_dataset$Keyword %like% 'mens cleats')

      
male_footwear$Group <- "Male Footwear"


# Male / Nightware
male_nightwear <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 'mens joggers' |
             keyword_dataset$Keyword %like% 'mens pyjamas' |
             keyword_dataset$Keyword %like% 'mens pyjama tops' |
             keyword_dataset$Keyword %like% 'mens pyjama bottoms' |
             keyword_dataset$Keyword %like% 'mens pyjama sets' |
             keyword_dataset$Keyword %like% 'mens dressing gowns' |
             keyword_dataset$Keyword %like% 'mens nightdresses' |
             keyword_dataset$Keyword %like% 'mens slippers' |
             keyword_dataset$Keyword %like% 'mens character' |
             keyword_dataset$Keyword %like% 'mens short pyjamas' |
             keyword_dataset$Keyword %like% 'mens bride & hen' |
             keyword_dataset$Keyword %like% 'mens loungewear' |
             keyword_dataset$Keyword %like% 'mens joggers')
   

male_nightwear$Group <- "Male Nightwear"



# Male / Clothing
male_clothing <- keyword_dataset %>%
   filter(keyword_dataset$Keyword %like% 't-shirts' |
             keyword_dataset$Keyword %like% 'long sleeve' |
             keyword_dataset$Keyword %like% 'polo shirts' |
             keyword_dataset$Keyword %like% 'mens loungewear' |
             keyword_dataset$Keyword %like% 'mens t-shirts' |
             keyword_dataset$Keyword %like% 'mens shirts' |
             keyword_dataset$Keyword %like% 'underwear' |
             keyword_dataset$Keyword %like% 'socks' |
             keyword_dataset$Keyword %like% 'mens joggers' |
             keyword_dataset$Keyword %like% 'mens jeans' |
             keyword_dataset$Keyword %like% 'mens trousers' |
             keyword_dataset$Keyword %like% 'mens hoodies' |
             keyword_dataset$Keyword %like% 'mens swearshirts' |
             keyword_dataset$Keyword %like% 'mens outfits' |
             keyword_dataset$Keyword %like% 'mens jumpers' |
             keyword_dataset$Keyword %like% 'mens t-shirts' |
             keyword_dataset$Keyword %like% 'mens long sleeve' |
             keyword_dataset$Keyword %like% 'mens polo shirts' |
             keyword_dataset$Keyword %like% 'mens loungewear' |
             keyword_dataset$Keyword %like% 'mens t-shirts' |
             keyword_dataset$Keyword %like% 'mens shirts' |
             keyword_dataset$Keyword %like% 'mens underwear' |
             keyword_dataset$Keyword %like% 'mens socks' |
             keyword_dataset$Keyword %like% 'mens workwear' |
             keyword_dataset$Keyword %like% 'mens sportswear' |
             keyword_dataset$Keyword %like% 'mens coat' |
             male_clothing$Group <- "Male Clothing")


male_clothing$Group <- "Male Clothing"

Rbind, Export Final Dataset

The final step is to combine all the new dataframes you have created. You do this using the ‘rbind’ function. All you need to do is add the names of dataframes (separated by comma). Then, once combined, export the dataframe as a CSV.

combined_dataset <- rbind(female_clothing, general_clothing, general_brand, general_transaction, 
                     female_footware, female_nightwear, female_lingerie, female_maternity, 
                     female_clothing, male_footwear, male_nightwear, male_clothing)

Conclusion: Text Analysis R

As you can see, once you have decided upon the keyword categories and got a bulked out list to query, the process of filtering and grouping is relatively simple. I would recommend downloading the files attached if this is the first time using this concept. Beyond that, once you feel comfortable, test and see what works (i.e. generates more ‘hits’ with the corresponding database). It would be unjust not to mention Merge Words, which I would recommend using when you start building out your version. It is a tool that makes concatenating strings of code a much simpler affair and is free.

NEXT UP

Thumbnail Artwork design for Twitter API article, which details how to the function to automate the finding of tweets.

Twitter API: Set up, authorisation and Search

We often rely on social data to serve as an account on the ‘public’ response, with Twitter being a strong indicator. This means it is vital to get a handle on how their API works – to get an absolute record, rather than often spurious social metrics provided on a service-provided dashboard.

Continue reading about Text Analysis R...

This article is a member of The Engine Room. This section of the blog comprises programming and other technically challenging criteria in one place.

Harry Austen is a Data & Search Analyst. He has worked with the likes of Disney, The Olympics and Zoopla. @austenharry  

This page may change periodically, as and when more information becomes available. Therefore, please remain patient if you are experiencing longer load times than normal. As this article is subject the above, please also be patient for immediate updates.  The Blog | Subscribe

Categories
The Engine Room

Twitter API: Set up, authorisation and Search

Twitter API: Set up, authorisation and search tweet requests with rtweet

We often reply on social data to serve as an account on the ‘public’ response, with Twitter being one of the strongest indicators. This means, if you want to get a handle on what folks are talking about the most, you need to get your hands on the Twitter API.

Thumbnail Artwork design for Twitter API article, which details how to the function to automate the finding of tweets.
The Twitter API is a great way to source social datasets, with an easy setup and using an intuitive programming language – you can pull insights in no time.
Harry Austen via Unplash
Harry Austen

by Harry Austen

Published Nov. 22, 2020 GMT

Updated Mar. 6, 2021 GMT

To accommodate those in a rush (or returning), I have added a table of contents below to skip to the relevant section(s) in this article. If you have just landed on this page and have little to no experience, I would encourage you to read the piece in full and not skip ahead. This way, you will be sure to make fewer errors and spend less time referring to this article. When it comes to writing and understanding syntax, particularly one of this nature, you will make mistakes. But that’s the idea. However, reading and comprehending the text will be a surefire of minimising those errors.

Equally, below you find the full script. Similarly to the above, if this is the first time on this page, I would recommend reading the full transcript before running the code.

Table of Contents

# Install & library packages
install.packages("rtweet") # this is the main package we will be using to execute the pull
install.packages("purrr") # helps R functionality
install.packages("ggplot2") # not in use for this, but good know for plotting visual data 
install.packages("httr") # GET rquests
install.packages("dplyr") # cleansing data
install.packages("data.table") # building tables
library(rtweet) #
library(purrr) 
library(httr) 
library(httr) 
library(dplyr)
library(data.table)

# Function to shortcut installation (if not already installed)
if (!requireNamespace("remotes", quietly = TRUE)) { # install remotes package if it's not already 
   install.packages("remotes")}


# Twitter Authentication
api_key <- "[Insert API key]" 
api_secret_key <- "[Insert API secret key]"
access_token <- "[Insert acccess token]"
access_token_secret <- "[Insert access token secret]"

# Create Token
token <- create_token(   ## authenticate Twitter APIs via web browser
   app = "14/11/2020",
   consumer_key = api_key,
   consumer_secret = api_secret_key,
   access_token = access_token,
   access_secret = access_token_secret)

# Fetch tokens
token <- get_tokens() # test authentication
token # print token, should list <oauth_app> ['Name of App']


# // pulling data for 2020-11-11)
s <- as.Date(Sys.Date()-13, format = "%Y/%m/%d") # function that formats x day ago as start date
e <- as.Date(Sys.Date()-13, format = "%Y/%m/%d") # function that formats the end date (date x, -2)

leicester_tweets<- search_tweets(q= "", n=5000000, type = "recent", geocode = "52.6442,-1.14862,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
bristol_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.4489,-2.62233,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
peterborough_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.6268,-0.0450012,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
south_west_london_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.4641,-0.167499,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
edinburgh_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "55.9265,-3.22776,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
liverpool_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "53.4159,-2.96287,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
north_london_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.5853,-0.114121,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
portsmouth_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "50.8021,-1.03495,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
coventry_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.367,-1.50119,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
brighton_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "50.8289,-0.119607,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
reading_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.4046,-1.02755,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
norwich_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.6461,1.34025,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
derby_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.9331,-1.49156,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
birmingham_tweets <- search_tweets(q= "", n=100000, type = "recent", geocode = "52.4651,-1.88866,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
portsmouth_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "50.8021,-1.03495,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
manchester_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "53.4787,-2.27173,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
wigan_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "53.5344,-2.63743,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)

Twitter API Application

As you can imagine, Twitter, much like other social media companies, really like to know what you will be doing with their data. It makes sense as clearly could be harm done by offering up the API to anyone. Notwithstanding hackers and people of that nature, there seems to be a large degree more of control taken on Twitter’s part to obtain this service compared to most. In terms of applying, first, you must fill out a form detailing how you plan to use the API – what audience will the coverage of tweets, is this a project-based innovate or new a client offering? Then, to get usage of the API, you will need to set up a developer account. The next stage of this process involves filling out another form with clear, detailed and accurate answers to questions Twitter. The questions aren’t an attempt to catch you out. It would be best if you were as specific as possible. Twitter is only looking to get an understanding of what data you plan to request. As long as it’s above board, you should be fine.

Fortunately, my application took a few days. To others, the process from application to running the API can be a fairly tortuous one. One last more thing to note that I haven’t seen discussed, and may not be a factor to take completely, is that I applied with a personal Twitter account. I’m not sure if this added credibility, but it’s worth attempting if you get denied the first time around for a business-branded account. Generally, according to forum submissions, acceptance can take anywhere from a few days to a few months. Therefore, if you plan to use the API for a project or new business initiative, it would be best to apply as soon as possible.

Set Working Directory & Edit Theme

I’m going to assume you have R installed and potentially already have a metric or project in mind that are you looking investigate more thoroughly with the API. If not, please do before jumping ahead with the library installations below. This could be anything an ultra-specific datapoint like the number of people tweeting about a football game in the UK. Ultimately, it really doesn’t matter. But to avoid procrastinating, pick a topic that you have an interest in and stick with it. By the time you finish this run-through, if you stick with the same theme, you should end up with a great dataset that you can analyse to pull tangible, fascinating insights. Not having a project or theme generally means you will test out the package, find a few results and perhaps end up losing interest. With a mission to end up with a full dataset, you can commit yourself to request more calls and automatically learn more from pulling apart the batch of information you have ended up with.

Your application will look slightly different from the one I’m sharing. I have arranged my layout differently (after years of figuring out what works for me) and changed the default theme. These are very easy to change and can be done at any time. The layout is perfectly fine to start off with and I actually wouldn’t recommend switching to the unorthodox set-up I have. If you really think you need to or would like to know for future reference: [Tools > Global Options].

To change the layout, head over to ‘Code’ on the left-panel in the Global Options. You may be overwhelmed by the amount of choice. Therefore, if you feel more comfortable directly copying what you can see on this page, below are ‘Editing’ options. The only extra change, aside the theme (below) is the highlighted line. Again, these style options have been decided upon after years of use so please don’t feel compelled to use what’s worked for me. If you would like to highlight the selected line, you can find the tickable option in the ‘Display’ tab.

Twitter API Theme Editing in R
R offers a great deal of options when it comes to themes and customization.

To change the design theme, colour, editor font and size, please click on the third tab below ‘Code’. I won’t be previewing this section as there’s a whole roster of themes to play around with. I would advice beginners to choose highly contrasting elements. In other words, make sure the background and text formats are strikingly different. That way, you won’t miss out lettering or important functions by mistake in the script. Generally speaking, the theme on entry should be perfect.

With R (language), R-Studio (editor, console) installed and theme decided, there are a handful of prerequisite packages that you must install in order to use ‘rtweet’ (the primary package used throughout this tutorial). As is common practice, you should create a new (or use an existing) folder that will store all R-related files. These files need to be easily assessable, in the sense of a clear (file) path. For example, on the desktop (preferably in the Cloud, if that option is accessible). The reason being that when you start to import and export data, you have a secure ‘home’ for all related files. Equally, it means you will not lose the dependable file needed to run the operation.

Once that folder has been created (empty at this point, if this is your first time using R), note down the simple file path. Then in R-Studio: [Session > Set Working Directory > Choose Directory].

Set Working Directory in R - Twitter API
Although a simple step in the process, setting your working directory is more crucial than you may originally think.

Generally, I like to get this sorted right at the start of my session. Otherwise, especially if you have multiple environments open at once, you can lose track of which folder you are plugged into. It also helps form the habit of setting the working directory and avoiding time down the line.

Install rtweet & Library Packages

Now you have set your working directory; you can start to install packages. Again, most of these packages are prerequisites. They contain helpful functions that will make for a smoother pulling of Twitter data. In R, similar to other syntax’s, everything after a hashtag (‘#’) is viewed as a comment and is not executed. In other words, when R reads through the scrip, top-to-bottom, whenever it reads the symbol, ‘#’, it will disregard everything following the command until it reaches the next line. Next to the packages, I have commented (‘#’) out the basic uses of the packages to be installed. For instance, the “httr” package allows you to make API GET requests. In basic terms the function makes the connection from a simple line of code that utters the API address.

install.packages("rtweet") # this is the main package we will be using to execute the pull
install.packages("purrr") # helps R functionality
install.packages("ggplot2") # not in use for this, but good know for plotting visual data
install.packages("httr") # GET rquests
install.packages("dplyr") # cleansing data
install.packages("data.table") # building tables

One they have all been installed; next, you will need to ‘library’ them to ensure they are saved to R’s working memory. In other words, by librarying packages, R knows which packages it will need to have on the shelf, ready to refer to at any time during this session.

library(rtweet) 
library(purrr) 
library(httr) 
library(httr) 
library(dplyr)
library(data.table)

This section of code will always need to be executed following the installation of packages. This is because you will need to have the packages libaried before moving on to the next stage of code.

Installation Shortcut

In R, if you leave the application you may have to reinstall the packages. To make things easy, I often use the shortcut below to see if the packages are still in use.

Function to shortcut installation (if not already installed)if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")}

Note: you don’t need to add this step to your script; it’s simply a shortcut to automate the installation of packages.

Twitter API Authentication

Once you have installed and added all packages to R’s working memory (library), you now have to confirm your Twitter credentials. This means authenticating your Twitter Dev account. To do so, you will need to head over to Twitter’s Developer site. To generate said authentication, you need to create a new ‘Standalone App’, which you can find under the ‘Project & Apps’ tab.

Twitter API Developer Portal
Although a little intimidating at first, Twitter's API documentation is laid out incredibly well.

Your account will look different from the one above, as I have already created a handful of projects. You have a cap of 10 projects per account. To create a new project, scroll to the bottom of this page and click: [+ Create App].

Create Twitter API App
Always ensure to name your app sensibly, avoiding similar naming conventions.

You will now find yourself on this page and will need to enter a name for your project. I would advise you to pick a project name that clearly describes your scheme. Once you’ve picked a name, you will be redirected to this page (below), which displays all your keys and token information. Note: These details are no longer active.

Twitter API API keys
These details are not current; it is not advised to share any Twitter API keys, token and other unique codes of that nature.

Please note these down in a separate document, as they will be crucial to the next few stages of this tutorial. Again, these details have been regenerated and are no longer active. These credentials should be exclusive to one individual; not shared with anyone other than the account holder. I have done so here for educational purposes and have since ensured the keys and token are deprecated.

api_key <- "[Insert API key]"
api_secret_key <- "[Insert API secret key]"
access_token <- "[Insert acccess token]"
access_token_secret <- "[Insert access token secret]"

You can find these details when you first create a new project (in the case above). Equally, if you haven’t noted them down, you can also find them under the ‘Project & Apps’ tab, when clicking on the key icon next to the relevant project.

Find Twitter API Keys and Token
It's worth noting that generally Twitter will only allow one user access at a time on the Developer dashboard.

You should keep more than one copy of these details. If you are working on a Team project, this is partially important as it means you will not have to keep referring to this page.

Create Unique Token

Now that you’ve done the hard work of figuring out your account details and assigned them (<-) to related dataframes (e.g. ‘api_key’), you can start to call these elements using simply the dataframes. In other words, you have created individual folders that contain the important keys so you, therefore, do not have enter in full keys every time you need them.

token <- create_token(
app = "14/11/2020",
consumer_key = api_key,
consumer_secret = api_secret_key,
access_token = access_token,
access_secret = access_token_secret)

The only section of this script you will need to enter manually is the name of the App. You can find these details under the ‘Project & App’ tab.

Once you execute the first line of code, you will be redirected to an authentication page, authorising the connection in the browser window you have open. You should only have to do this on the first time you install and the package.

Fetch Token

Now you have created the token it is time to fetch the token for use in this session.

token <- get_tokens() # test authentication
token # print token, should list <oauth_app> ['Name of App']

This action will need to be taken every time you run this script. In a similar vein to the above step, when you request the token you may also be redirected to a browser window. The second line of code should list the <oauth_app> (App name), you decided when creating the project at the start of this process.

The second step isn’t a requirement, but I always find it useful to know that I’m calling the correct App, especially if you have multiple projects you are working on at once.

Formatting Dates

This next function within the script is on the trickiest. It automated the action of having to manually enter the dates in which you want to search for tweets.

s <- as.Date(Sys.Date()-13, format = "%Y/%m/%d") # function that formats x day ago as start date
e <- as.Date(Sys.Date()-13, format = "%Y/%m/%d") # function that formats the end date (date x, -2)

Essentially, you are creating two variables (‘s’ and ‘e’). The ‘s’ dataframe is referring to the start date and the ‘e’ dataframe is referring to the end date. It’s important to note that you are selecting the previous dates against the current date. In other words, you are minusing today’s date to the dates that you would like to start (and end) your request.

Notes on API Caps & Limitations

Generally, the caps on the number of tweets you can pull per sitting shouldn’t be a problem. However, with this particular case study you can burn through credits without batting an eyelid. With that being said, please note that you can request up to 500k / month. This can change based on what function you are using within the ‘rtweet’ package, but you should generally plan your requests around that figure as the max.

Request Twitter Data

At this final stage, you will now finally be requesting data. As mentioned previously, there are a number of functions the ‘rtweet’ package offers. For this example, ‘search_tweets’ will be used.

The other functions work in the same way, so if you would like to explore the other bits of functionality the package has, most of what you see on this tutorial should hold true.

For this last section, the only parts of the string to focus on are:

q = # a tweet query
n = # the number of tweets
type = # recent
geocode = # geo-specific location of area
retryonratelimit = # retry request if it breaks
since = # this is where you will enter the date formats

I have commended in explanations of the important elements of the ‘search_tweets’ function above. To elaborate on a number of points; to find geocode information there are a number of great resources. The one that I primary use is here. The request for ‘verbose’ output simply means that you want to see the entirety of tweet information. Per the documentation, “a verbose connection provides much more information about the flow of information between the client and server.”

For the below example, I will be searching for tweets in regions of the UK (without a query). I have capped the n(ubmer) of tweets to 1M, which you may want to change in line with the cap mentioned above. The function will only search within the dates you have selected, and therefore never reach the cap. Notwithstanding the shortness of dates from start-to-end, there can be times that requests surpass this limit. Again, if you think this may be the case, I would recommend changing the ‘n=’ to something more appropriate.

leicester_tweets<- search_tweets(q= "", n=1000000, type = "recent", geocode = "52.6442,-1.14862,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
bristol_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.4489,-2.62233,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
peterborough_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.6268,-0.0450012,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
south_west_london_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.4641,-0.167499,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
edinburgh_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "55.9265,-3.22776,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
liverpool_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "53.4159,-2.96287,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
north_london_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.5853,-0.114121,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
portsmouth_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "50.8021,-1.03495,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose = TRUE, since = s, until = e)
coventry_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.367,-1.50119,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
brighton_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "50.8289,-0.119607,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
reading_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "51.4046,-1.02755,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
norwich_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.6461,1.34025,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
derby_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "52.9331,-1.49156,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
birmingham_tweets <- search_tweets(q= "", n=100000, type = "recent", geocode = "52.4651,-1.88866,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
portsmouth_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "50.8021,-1.03495,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
manchester_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "53.4787,-2.27173,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)
wigan_tweets<- search_tweets(q= "", n=100000, type = "recent", geocode = "53.5344,-2.63743,5mi",parse = TRUE, token = token, retryonratelimit = TRUE,verbose =TRUE, since = s, until = e)

Twitter API Conclusion

As you can see, within a matter of around ten steps, you can start to pull data via the Twitter API. If you would like to explore Twitter’s API further, I would recommend reading through the rtweet documentation, as it contains tons of useful information and other potential uses for the API.

NEXT UP

Continue reading about rtweet...

This article is a member of The Engine Room. This section of the blog comprises programming and other technically challenging criteria in one place.

Harry Austen is a Data & Search Analyst. He has worked with the likes of Disney, The Olympics and Zoopla. @austenharry  

This page may change periodically, as and when more information becomes available. Therefore, please remain patient if you are experiencing longer load times than normal. As this article is subject the above, please also be patient for immediate updates.  The Blog | Subscribe