Clustering: Backlink Attribution Model4 min readReading Time: 3 minutes
This piece reflects on a project I worked on that involved building a model that reviewed correlations between our outbound link profile and weighted metrics known to correlate with better search results. The link attribution was later applied to forecast several campaigns and initiatives.
The reason for building an intricate model, theoretically attempting to figure out how Google ranks based on links, starts with Ahrefs. And those damn links. When you are working for a small site with a mid-level of engagement in terms of links – Ahrefs, SEMRush and other web scrapers are incredibly valuable as a measurement on current backlink performance. When you work for a client with over 20M links, it’s a different story.
The reality of Ahreh’s crawl quality and the coverage, or lack thereof, meant for constant discrepancies with link data. It’s worth noting that this isn’t entirely Arefs’ fault. I couldn’t imagine the amount of planning that goes into a daily crawl of links on their end and for this client in particular, links are a bit of an anomaly to start with. With the metric underpinning a reasonably large chunk of importance, accurate reporting is critical to understanding and forecasting SEO long-term. There have also been reports of Ahrefs fudging the numbers for analytics. Again, this is not a true reflection on their output, as they offer very good SEO services. But it’s always an idea to take their, and other services for that matter, with a grain of salt.
In a task that I wouldn’t recommend to anyone, I went through our entire LRDs (outbound link profile) backlog to manually spot issues ahead of running. One-by-one, for just over 4k LRDs, was a delicate experience. The silver lining of completing such a mammoth task was that a) I got a first-hand look at how egregious the links Ahrefs collected were, and b) I completed about four years worth of disavowing in one week. Equally, the web and IP disavows were fitted as our exclusions. Due to the size and the fact that we wanted to target both web-addresses and IP-addresses, we were forced to make these exclusions in waves of three. In other words, three scenarios in which we would have a chance to add additional exclusions. These scenarios would be in either R itself, BigQuery and (eventually) DataStudio when we visualised the results. We wanted to be very cautious on our pull limits (the amount of data we wanted request) – 20M ain’t cheap. And this was a good way to limit the damage. No-follow links were also excluded from our list, seeing as they didn’t fit our SEO-focused criterion.
Once we had a cleaned-up set of links, it was time to run them through the model to test our Complex Hypothesis. Included in the list of theories to test were both SEO myths and truths; testing the likes strength of Domain Authority, Page Authority, LTD, number of total external links, language, and everything under the SEO sun. After consideration, we decided the simplest way to express a link’s ‘true value’ was a score out of ten.
As well as weighting metrics that are known to correlate with better search results, we also added damping factors such as (aforementioned) no-follows and freshness (date published). These damping factors are known as coefficients. In technical terms, for the statisticians reading, this model was both a ‘Coefficient of Variation’ and ‘Relative Standard Deviation’. There are several excellent guides, but I’ve linked the ones (above) that I found useful when explaining the concept.
Once we ran through the model (which I can’t share for reasons you already know), we started to see results in the form of signals. For example, initially, we hypothesised that Domain Rating was the largest factor and weighted it accordingly. However, the result wasn’t the case – as was the same false prediction with several other metrics. So we adjusted the weighting. Again and again, until we got to a clear conclusion and, ultimately according to our data, a clear idea to which ranking factors had the most authority when it comes to links. Once standardised, this was then fitted within our weekly reporting moving forward.