What Does Common Crawl Mean for AI Citations? With Metehan Yesilyurt

Common Crawl is a large, open web dataset that was historically used in early LLM training.
According to Metehan, blocking Common Crawl may not remove a site’s link influence, because inbound links still affect its metrics.
Harmonic Centrality and PageRank reflect authority and visibility in Common Crawl, but should be used only for benchmarking, not AI ranking.
Relevant mentions from authoritative publishers remain one of the strongest drivers of AI visibility, according to Metehan.

There is a non-profit organization that has been crawling and archiving the internet for years. No, it’s not the Internet Archive, it’s called Common Crawl.

Metehan Yesilyurt, Chief Growth Officer at AEO Vision, recently published a study examining how Common Crawl may affect AI LLMs. (Our study found that about 75% of news sites block Common Crawl.)

In this podcast, we discuss what Common Crawl is, how it works, Metehan’s new tool for finding where your site ranks in Common Crawl’s WebGraph, and much more.

If you are interested in AI and AI citations, this episode is a must.

Here is the transcription:

What do we know about how AI gets it training data?

Table of Contents

Metehan Yesilyurt
Let’s start with the AI training part. Actually it was a big question for me for months, almost more than a year.

As we know, ChatGPT and other LLMs besides Perplexity actually brought web search almost a year ago in February, and now we are using it for almost every daily stuff, question, research, academic research, etc.

But besides that, still some questions are not grounded in the LLM’s backend system or their class.

They have a very complex infrastructure at the moment. So the AI training part, I call it offline training data. So it was always interesting for me. So I started working on it. In the first step, I found nothing actually.

So I was very upset. But it’s the nature of our business.

We need to push it hard. So then I continued with my research. Then I realized that actually everybody knows all LLM, all LLMs, all AI models actually need data.

So they need to find some open source data, et cetera. I already knew CommonCrawl before and they have a very big public open web data, which is huge. It contains gigabytes of data, petabytes.

So I realized that CommonCrawl also has some other features, other datasets like web graph.

They also have host index.

Thanks to Greg, the great CTO of Common Crawl.

Right now they have also great team.

I met with their team after my post and tool. I, the actual question started with the AI training data, of course. And then we see some domains more often in the responses.

So I believe there should be some reasons behind it.

If it’s about some crawling data from the open web or some open source data from Hugging Face or GitHub or other platforms.

So it all started with this, and I realized there is domain-level data, including host details.

It was very interesting and very big data. I tried to manage it, and I believe I did. So it all started with this question. What do we know about AI training and LLM sources?

What is Common Crawl?

Metehan Yesilyurt
Common Crawl is a non-profit organization and foundation that actually crawls the whole web.

With respect to some rules, like Robots.txt, some publishers or even independent websites can opt out of the system. They have a crawling system like Google or Bing or other crawlers.

And we also know Meta has crawlers, other platforms, third-party SEO tools, etc. So Common Crawl is a non-profit organization, and they keep this crawling data on the open web. Amazon is sponsoring the hosting needs at the moment, so you can access their data publicly, but as we mentioned, it’s tons of data right now.

Vince
Yeah, and to elaborate too on the, we did a study on news publishers that are blocking AI crawlers, and we found CCBOT, which is the common crawls, was blocked by 75 % of the top 50, sorry, top 100 news sites in the US and UK.

What platforms do we know use Common Crawl as part of their training data?

Metehan Yesilyurt
Yes, we have some public records more than two years old, I guess in 23, and I guess early 24.

We know some early GPT models were trained on Common Crawl data, and Google has other processed data from Common Crawl.

So common crawl, actually, yes, they are storing some crawling data.

That’s correct. They’re also crawling the web layouts. Yes, that’s okay.

And we didn’t know Common Crawl well in the search industry, but every major publisher already knows it and can identify it from server logs.

And almost every major publisher in the USA, in the United Kingdom, started to block CCBOT at the moment. It’s a debate, actually.

I won’t be involved in this debate at the moment. I will stay in the context.

Yes, besides the crawling data, they are also crawling some link connections.

We can say, as in traditional SEO, like backlinks.

Let’s say backlinks: they also store some backlink data, and they created two metrics.

The first one is the harmonic centrality, and the second is the PageRank.

I believe they are also using a very common word from Google, almost two decades ago.

So this is the current situation.

Big publishers are blocking CCBOT.

We know there’s open web data and they are crawling the web in a regular basis. So they are also releasing this data.

And it’s actually very useful if you want to build your own large language model on yourself or for your company.

So there’s a huge dataset. And also, Reddit is blocking CCbot.

And we know Reddit has some different agreements with Microsoft, Google, OpenAI, etc.

So there is a mutual agreement, it seems, with the big players in the market at the moment.

This is what Common Crawl does at the moment. They have different datasets. The only problem isthat there is no actual tool to check your website. Of course, Common Crawl has some basic tools but it’s obvious we need more about it.

Vince

So the platforms that are using Common Crawl, we know you said a few years back, I think it was Gemini, ChatGPT actually published some papers saying that they had used CCBot in very early, early models of like GPT.
So we don’t know to what extent they’re using it now, but we know they are probably using it.

Metehan Yesilyurt
Who knows, this is the biggest dilemma at the moment.

I believe personally, who has the biggest data at the moment in the world, also they can be the biggest player in the market and rule everyone.

So everyone keeps their data sources at the moment, so we don’t know if they are using Common Crawl or not.

Can you briefly explain Common Crawls’ Harmonic Centrality and Page Rank?

Metehan Yesilyurt
Yes, they can also identify some spam metrics, I guess.

They mentioned not advanced as Google, but let me try to explain it as PageRank seems similar.

But the Harmonic metric is actually it calculates like how is your website positioning in their dataset.

If you want to be in the center with the higher rank with the harmonic metric, actually you need to be a very big player.

So let’s say if you’re a well-known founder and let’s say you are running your own website or personal blog, you can get higher page rank.

But if your website isn’t

being linked by others, your harmonic metric can be lower than page rank. So it also shows your website positioning in the old crawl data and they are basically telling, I mean, common crawl team, they actually didn’t crawl the whole web, every web page in the world.

Just like Google, we are facing some indexing problems, know, and millions of pages are just going live in every minute.

So these are the metrics, and one shows your link metrics, one shows your website positioning in the cloud data.

Vince

The idea being the higher, the closer you are to the center, the better. Like that means the more links you have, essentially.

Metehan Yesilyurt
Yes, it seems you’re more authoritative.

Let’s say some big major publications in the United States or in New York or in London.

We know many great also brands are actually very close and they have higher metrics in the page rank and harmonic scores.

How do these two metrics correlate with what AI systems know and what they cite?

Metehan Yesilyurt (14:55)
Yes, okay. Here is the great question because Common Crawl is also saving some host data.

They also tried to grow some CDN servers, some other third-party software platforms, et cetera.

So we also see these domains and top-level domains, sub-domains also in this data, and they have very high scores at the moment, but they are not useful for the AI training.

So there is a need to clean this data, make it ready for the training phase, and we also know that right now large-language models are preparing for the pre-training and post-training phases, et cetera.

So there’s a multi-layer configuration actually.

And we don’t know whether they are using Common Crawl datasets; we don’t know the exact position in this configuration or layered system.

So there’s another dilemma, but this is a great question.

I can say we don’t know exactly what it is or how they use it.

We do know that if you run successful PR campaigns worldwide, let’s say, you can actually increase these rankings and metrics in the next data set published there, let’s say next month.

We don’t know exactly how can you, how your brand being cited more in AI responses.

We don’t know this exact time, exact timeline, but you can see your progress in this metrics. And I believe this is very important. I don’t check these metrics as a direct ranking system or direct rank metrics in AI citations or responses.

We are also now living in a prompt tracking world right now.

But I believe this is a non-direct signal and you can actually use it for competitor benchmarking.

Do you think link relevancy plays a part in this?

Metehan Yesilyurt
Yes, I believe, based on my experience and experiments —and from an SEO perspective—I love saying SEO, by the way. I respect I also use AEO or LLMO. Whatever we say, we can call it whatever we want, but GEO is the most popular at the moment.

So that’s all fine. I respect every opinion—so what can I say? Yes, relevance is the biggest weapon at the moment for AI search, because we know OpenAI and other LLMs.
Gemini is a little bit different because Google still has the system for ranking, for crawling, for layout parsing, etc.

That’s another topic.

So, relevancy seems to be a core layer of this optimization process.

Absolutely, you need to get relevant backlinks and relevant mentions from other websites.

So this is the key because we are now living in a semantic world, and they also check some connections and I believe context graphs and will be a new hype in this year.

We’ll see.

That’s a prediction.

And thanks for bringing this topic into AI search world. Andrea Volpini. Thanks, Andrea.

So yes, this is the current situation.

Relevancy will decay. And if you want to increase your harmonic metric ranking, it seems you still need to be mentioned from some big publishers.

We don’t know the exact formula of these metrics but it seems so.

Yes, relevance is key, but sometimes getting things from big players is also fine.

If 75% of news sites block Common Crawl, are we chasing a ghost?

Metehan Yesilyurt
Yes.

I believe in near future most of the big players will agree with these LLM companies including OpenAI, Anthropic, Google Gemini, Google Assa already agreements wit very big players in the market.

So it’s very soon.

For the Common Crawl data, yes, you can block CCbot.

Yes, you can opt out from the crawling data, but still your links are pointing your opted out domain.

So you will still have these metrics, and I believe it will be affected— I mean, your harmonic centrality score.

But still we know some big news websites are very effective for the AI search being cited for your brand. So it’s still effective. Yes, there is a small gap at the moment.

25 % of the websites are still in CCBOT, but we’ll see what happens in the future.

Yes, rankings will be affected and that’s a debate topic actually if big major platforms want to give license for their text or images that’s another topic but this is the current situation at the moment.

Do you think publishers who don’t partner up with AI will get left behind?

Metehan Yesilyurt (23:58)
Yes, at some point and also let’s change our direction for a moment.

There’s a non-profit company is already crawling the web more than a decade and they are really talented at on their tech stack.

So they can crawl faster, they can also process it, and they also store a very large amount of data.

And on the other hand, big large language models need very big data.

So this is the picture. I can’t say they’re using Common Crawl data because I don’t know. This is the picture.

There are some speculations and of course, after some suing cases in the US between some big companies and you know it from Google recently.

So, because of this chaotic environment, I believe right now everyone keeps their data source in a very secret lab.

Also, Common Crawl, I can never say that Common Crawl is the first priority need for large language models because there are some other data platforms and open public web data in other platforms, but this is the situation.

So I can never say they’re using it directly, these link metrics are affecting these citations in under a second, let’s say, but if they need a link data there are some open public link metrics on the internet.

That’s what I’m trying to tell everyone.

So please do not use these metrics directly.

Again, PR works like a charm.

So find the best tool in the market for your outreach efforts and start using it.

What didn’t I ask you about that you want people to know?

Yes, first on the technical side, I will be honest and there are some great suggestions from my friends in the market thanks to them. I’m having some performance issues with my tool. I’m hosting on Cloudflare if anyone is watching from Cloudflare so please reach me out.

Second, there’s a new data hopefully in this week dropping out for the new web graph data. I’m planning some performance upgrades with adding more data.

There are currently more than 150 million domains in this overall datasets based on domains and I’m trying to provide at least the top 10 million at the moment but we’ll see.

So and I’m expecting many other great companies will build some benchmarking tools let’s say on this data but in the last six months it was very hot topic between some C-level managers, executive directors, shareholders, stakeholders. mean how these AI systems know about our brand without searching anything online.

So that was the hot topic and I want to address there some data sources you can find some your answers maybe so we’ll see but this year started like in a rocket mode you know some many video tools around with your command line so there are some new bots at the moment you can set up it in your Apple Mac Mini etc so we will see but this year will be huge for us.

Yes, and please please update your content content freshness is a very huge boost multiplier at the moment use it very effectively and please increase your outreach effort effort.

I’m not I will be honest, I don’t say this because I’m in a BuzzStream podcast, but please review your outreach efforts because it works until it doesn’t.

Yes, I mean for the listicles, I mean for the best comparison alternatives, et cetera.

But if it’s a short term game, use it very effectively with the logic. and consider your PR release talk with your PR team and review your all new year process because we are in the late January at the moment.

Yes, of course, but this is a great time to consider your new year plan with your marketing team, with your developer team.

So I can also add these notes from my site because these are the hot topics at the moment. I’m talking with literally every company I’m contacting with.

Vince Nero

Vince is the Director of Content Marketing at Buzzstream. He thinks content marketers should solve for users, not just Google. He also loves finding creative content online.

His previous work includes content marketing agency Siege Media for six years, Homebuyer.com, and The Grit Group. Outside of work, you can catch Vince running, playing with his 2 kids, enjoying some video games, or watching Phillies baseball.

Source link