Download Top 1 Million Sites

Jul 20, 2020 | Security - Internet, WordPress, and otherwise




Data sets of the top 1 million Internet sites are simply compiled lists of web sites (or domains) that are found to have the most traffic. What follows are some of the most popular and well known data sets of the Top 1 Million Sites.

Depending on the methodology used; the results can have significant variability. However, having a reasonably accurate list is beneficial to the many use cases that these lists can be applied to.

download-top-1-million-sites Download Top 1 Million Sites

Alexa Top 1 Million

https://www.alexa.com/topsites

Established way back in 1996 Alexa had a popular toolbar addon for web browsers. By using the data collected by the toolbar Alexa developed a top sites list and made it available via a web application.

The Alexa list while primarily aimed towards marketers was used for many research projects, as it was reasonably accurate, easily accessible and became the most well known resource.

Alexa also offered a Top 1 Million List in csv format that could be downloaded for Free. This was an excellent resource and it found many use cases.

Now owned by Amazon, they have recently restricted access to the top 1 million list to paying customers. For a time there was a list available at http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, however this appears to be no longer updated and incomplete.

Cisco Umbrella

http://s3-us-west-1.amazonaws.com/umbrella-static/index.html

The Cisco Umbrella list is quite different. Still based around the top 1 million most popular sites, the list is put together from Cisco’s visibility into DNS traffic. Rather than being primarily around what are the most browsed to sites, they are getting what are the most popular host names being resolved in DNS.

As it is based around popular DNS requests, there are domains in the list that are not in the Alexa list. Subdomains of primary sites that host other web resources (js / css / images) and even tracking domains used by analytics packages.

The use cases for this list tend towards security and network monitoring. The security use case is not surprising given that Cisco maintains and compiles the list.

Although the data source is quite different from Alexa’s, we believe it’s arguably more accurate as it’s not based on only HTTP requests from users with browser additions. The way the ranking is computed is not as simple as the net sum of all DNS queries.
Cisco Umbrella

Majestic Million

https://majestic.com/reports/majestic-million

Publishes a list daily that is compiled after analysis of web crawls. Sites are ranked based on backlinks, this is a similar methodology used by search engines.

Majestic’s primary use case is marketing and SEO.

Quantcast

https://www.quantcast.com/

Aimed at marketers, the data is based on traffic from “Internet Service Providers and Toolbar Providers”. For this reason the data is only for US based traffic and updates are provided monthly.

In the past this was a free resource, but it now requires an account.

Tranco-List.eu

https://tranco-list.eu/

A recently minted list, this Free to download list uses methodology that combines some of the other top 1 million site lists mentioned above. By using a combination of lists they believe they have a more accurate list and have even written a paper to explain it.

Created by the team over at ripe.net; they published an interesting article comparing Alexa, Cisco Umbrella, Majestic & Quantcast.

As shown clearly in this graphic there is very little similarity between the different lists.

download-top-1-million-sites-1 Download Top 1 Million Sites
ripe.net study showing list similarity

Similar Web

https://www.similarweb.com/top-websites/

Another marketing focused site that offers data. Only the top 50 sites are available from the site unless you upgrade to a paid plan.

Moz

https://moz.com/top-500/download/?table=top500Domains

Moz is a search engine optimization service (SEO), they have a large data set of search related data. Using this they make available the top 500 sites for Free.

Netcraft

https://trends.netcraft.com/topsites

Established in 1995 Netcraft is another company that has been around since the early days of the Internet. Internet Data Analysis and Security would describe the core functions of Netcraft. They have extensive data on web hosting across the Internet going back to 1995.

Some of the work performed by Netcraft results in the take down of phishing sites and other cybercrime related measures.

DomCorp

https://www.domcop.com/

Using data from CommonCrawl and CommonSearch the DomCop project has compiled a list of the top 10 million sites. Better yet the full site list is available for Free Download.

CommonCrawl

https://www.commoncrawl.org/

While not an easily downloadable list, it is a resource of web sites that can be downloaded for Free. This excellent project builds an archive (snapshot) of the web every 2 months. All the page meta data along with HTML, HTTP Headers and other information is stored in archive on Amazon S3. Completely free to download, it is a massive resource – the latest full archive came in at 53TB compressed! Have fun!

53TB
compressed!

Source

WordPress Development

SEO NEWS

seo news

Domain Registration
GET YourName.com Today

We’re listening.

Have something to say about this article? Share it with us on Facebook, Twitter or LinkedIn:

SHARE IT HERE:

Subscribe ToThe Weekly SEO Trade News Updates

Get the latest SEO, SEM and SMM marketing intel, tips and tricks from one of the best SEO Gurus online. 

Every Tuesday morning we send out an aggregated email listing all new posts on SEO Trade News.

Excellent! Now check your email to confirm your subscription.