As of 2021, there are an estimated 56.5 billion web pages indexed on Google. 56.5 billion is an unimaginably large amount of web pages, and as you read this article, more web pages are being added.
Millions of queries are done every day in Google and each day new content in the form of articles, guides, videos, photos are being added. This giant, complex jungle of text and media seems like an impossible place to navigate. It’s like looking for a tiny piece of paper in a pile of paper that’s the size of a city.
So surely Google must take a considerable amount of time to scour through these billions of pages and present you the right information? Not exactly. Take a look at this:
For a simple, general query of “Best Books to Read”, Google presented me with a gigantic 4+ billion search results within a second. This sounds like magic. And if you stop and think about how search engines work, even for the simplest search query, it does actually feel like magic, with a lot of algorithms and clever indexing.
This is the ultimate guide on everything you need to know about search engines, what they are, how they work, and how they are evolving and changing the information you see. Let’s begin.
What is a Search Engine?
Before we try to understand the complex process of how search engines operate and separate the millions of similar web pages and present to you the most relevant and helpful ones, we need to understand what a search engine actually is.
Search engines are made of two major components; an index of information and a very complicated set of algorithms that contains the rules for searching the index of information. There are other smaller components too, such as crawl bots, vitals indicators, etc that look for other information, but they are all part of the algorithm.
Some of the search engines include Google, which is also the most popular search engine in the world. We are very sure that you found this web page through a Google search. Google is so big that its name is used as a verb. Can’t find something? Just “Google” it. But Google is not alone in this search engine space. There’s Microsoft’s Bing, DuckDuckGo, China’s Baidu, Russia’s Yandex, Yahoo, etc.
What many people don’t know but even the popular image boarding site Pinterest is also a type of search engine, and so is the popular forum Reddit. Any website that lets you search for information and present it to you based on relevancy is a type of search engine.
How do Search Engines Work?
Before we get to the most complex and important part of this article, we’ll give you an example of how our own mind works and how it is very similar to search engines.
Whenever you see someone familiar, your brain does a very quick search of all the known faces you have ever seen. There is an index of all the recognized faces (and even similar faces) in your brain. Anytime you see someone, this quick search occurs and when your brain finds the perfect match, you recognize the person.
When you see someone for the first time, three major processes occur. First, you look at their face and their facial features. Then you store this information in your brain, which is remembering how they look.
Finally, you associate qualities to them, such as their name, relation, how you met the person, and so on. In a very similar way, search engines such as Google work.
The Big Three Process of Search Engines
Google and most of the other search engines follow the “three fundamental processes” of organizing and providing information. These three processes are Crawling, Indexing, and finally Ranking.
These three processes happen instantly when a user searches for something, just like when you see a recognized face. This is how Google was able to provide my query with 4+ billion results in a fraction of a second. Let’s take a look in detail at each of these three processes:
1. Crawling
A very smart way of calling the process of searching the internet, crawling is a reference to how the internet is called the web. Crawling is the process in which small programs called crawl bots (Googlebots), or more aptly, “spiders” scour the entire web, looking at links of the website, going through pages of the website, and noting all the important factors for consideration.
Crawling is a process that is completely dependent on the search engine, all the parameters, and duration. This means that Google decides when and how often a website will be crawled and which pages will be crawled. These “spiders” go through a website, look at the different content and then decide how often it must be crawled.
For example, a website with static web pages with fixed content (such as privacy policy pages and such) need not be crawled every day or even every week or month. So Googlebots avoid those pages and focus more on the dynamic content that keeps getting updated. But it is not Google’s task to ask for a website to be crawled.
Websites owners and SEOs submit their website to Google’s Search Console for it to be crawled. If a website is not crawled by Google or any other search engine, users won’t be able to see it on the results pages. So submitting a website for crawling is the most important factor for ranking. But how do you submit a website? In the form of a sitemap.
2. Sitemaps
When Google sends its crawlers, they do not need the entire content of your website. This means that what type of fonts your website uses, or what’s the color of your homepage, all this information is useless to the crawlers. All they want is a simplified version of your content links.
A directory through which they can easily access your entire website. This directory of a website of its entire HTML content is called the sitemap. Take a look at the sitemap of a website in the image below.
This is how the crawlers look at the website and store the entire information. As you can see, a large website with so many different parts can be sorted and organized in this simple format, a directory to all of their important content.
So SEOs and website owners submit their websites in this format. Once submitted, the crawlers crawl every index of the sitemap, decide which part should be crawled frequently and which rarely.
Robots.txt: Robots.txt are small commands that websites can add to their sitemap. These commands are for the crawlers or spiders. Using Robots.txt, we can suggest to them which part of the sitemap is the most important one (which requires frequent crawling) and the parts where crawling can be skipped. Note that you can only suggest the crawlers, not command them.
So the entire directory of a search engine, including Google, is a library of these sitemaps through which the algorithms can sort and find results. The bots look at the content of these URLs, how often they are updated, and then the next process begins.
3. Indexing
Indexing is the next process where the crawlers organize and store the website indexes into the search engine’s index, their own depository with all the other parameters telling the algorithm how to process these indexes when a query is made. Google’s depository of indexes is called Caffeine. This part of the process is far more important than you can imagine.
Indexing here does not just mean keeping these sitemaps in large stacks to be accessed later. Indexing is a very complex process where the massive website data from billions of web pages are sorted and arranged in such a way so that accessing them would be easier, accurate, and faster. This is more like sorting the books into different categories.
All the fiction books go to one shelf, where different genres of fiction are kept into different racks. Furthermore, more criteria are used to sort these books into separate sections so that when someone asks for the type of book they want, Google can access it quickly and accurately.
All the websites that one can find in Google exist in their large library of sitemaps, sorted in such a way that Google is able to access, correlate, and present to the users in a fraction of a second. So this was the second process. Now comes the most important and complicated part of the entire operating process of a search engine; Ranking.
Google Ranking
While the other two processes are equally important for a search engine to operate, ranking is the process that is the most visible factor. This is what differentiates all the different search engines, and this is the process that affects the users the most, along with web publishers and eCommerce website owners.
There is so much to ranking that no one can explain the entire process, not even the developers at Google. There are around 300 different factors that the search engine algorithm takes into account when deciding the ranking criteria. But what is ranking?
Remember the simple search I did at the beginning of this article that showed me over 4 billion search results in less than a second? The first page of Google shows only around 10 search results, so out of 4 billion results, what decides which web page comes in the prestigious top 10? This is what ranking is and it can make or break a search engine.
The reason why Google is the top search engine in the world is all because of its strict, advanced and impeccable ranking algorithm. Let’s try to understand how Google ranks, and why.
Why Google Ranking Matters?
Providing the most accurate and relevant results for search queries is the number one priority of all search engines. Each year, with each algorithm update, Google tries to refine the search result page.
There are billions of results for one search query, but surely someone has to do the quality control to provide the most relevant and helpful answer. This is what the algorithm part of the search engine does. This is why ranking matters so much for the users.
Everything that is present on the web has a very important identifying piece in them; a keyword. Keywords are the words users use to search for information. Citing the previous example, my query for “Best books to read” has “best books” as the core keyword, while the “to read” portion is considered as the semantics of the query.
Let’s get to know what these two are in detail.
Keywords And Semantics
Keywords are the first thing the search engine looks for when fetching the results. This is the guide for them to find the appropriate index to look for. So whenever you search for “study table” or “best car wash”, or something similar, the keyword is the most important part of the query.
Not just the query, but when content publishers create the content, they make sure that the keyword is used properly and clearly so that it is easier for the crawlers and the readers to understand the content in relation to the keyword.
But no search query is simple. Everything else in a query other than the keyword is the semantics of it. This is what brings the user intent and search intent of the query. In “best books to read”, I am looking for the best books to read.
A search result page showing the best books for cooking or the best science fiction books would be wrong. While both the results have keywords in them, the semantics are completely different.
Another important thing that Google’s algorithms do is understand the intent of the searcher, adding that to the webpage analysis and using intelligence, producing accurate results. In a way, it is simulating a human brain. If you search for “water jars”, Google understands that you might be looking to buy water jars rather than trying to know what these jars are. This is why you get shopping links in the first portion of the SERP. The same thing happens with articles. Google looks at the keywords, finds related keywords, and then provides the best article that matches the intent of the reader.
Google’s recent BERT update was all about understanding the context of the query. This helps them differentiate between a question that is asking for the review of a restaurant and a question that is looking for the direction to the same restaurant. This sounds very simple here, but creating algorithms that scan through so much information, trying to find exactly what you are looking for is a very complex and difficult process.
So how does Google rank web pages? This is a question that many have asked, none could answer. And if anyone says that they do have the exact answer to this question, they are lying.
No one knows or can know the exact ranking factors and how each of the factors weigh when it comes to ranking. But we sure do know what some of these factors are and how they affect ranking because Google has said it themselves.
It is impossible for Google to sort and arrange web pages unless and until all of them follow certain rules and guidelines. These guidelines are necessary if publishers want their content to be ranked. In the common cyber lingo, we call it search engine optimization or SEO.
A well-optimized website has more chances of getting ranked, along with great quality content that addresses the search and user intent. But that’s not all when it comes to ranking.
What does Google look for when they are ranking the web pages? A lot. Here are some of the factors that affect the ranking of web pages and why they are so valuable for the same.
Factors Affecting Google Ranking
Backlinks
Backlinks have always been one of the most powerful ranking factors and they always will be. Backlinks are powerful because they show that your website has authority. The number of external websites linking to your website means that there is some valuable information.
Getting backlinks is one of the most difficult things to do, but even a single backlink can push your rankings so high and give your website an influx of traffic.
Google looks at all the inbound links that are coming to your website. Not just that, but it takes in other factors about the links such as what type of website is linking to your website, how’s the quality of the link, the anchor text in which the link is embedded. The website that is linking to you must be an authoritative website, the link must be a do-follow link (which means that the website is asking crawlers to go to your website), and most importantly, it must be relevant to the content.
For example, if two big sites are linking to one website for some keyword, this particular website will automatically rank higher in the SERPs than other, similar websites.
Unrelated, low-quality links with no relevance are not only trashed by Google but sometimes websites get penalized for it. This is why getting the right backlink for your website is more important than ever, since 2022 Google is continuing the use and impact of backlinks.
Interlinking
Another very important ranking factor for a website is the amount and quality of internal linking. Interlinking is when an article contains links to other articles on the same website that are somehow connected to the current one.
For example, here’s a detailed guide to understanding what SEO is and how it is going to change in 2022. This interlinking of web pages creates a well-organized index, very easy for crawlers to crawl and get relevant data. This makes it easier and beneficial for the readers as well, which is why Google considers it as a ranking factor.
The Freshness of The Content
No one wants to read an article or see something that was created 5 years ago. The world of information has a short expiry date, and people want the latest information in everything. Google understands this very well and that is why content freshness is one of the ranking factors. But what does content freshness mean?
The crawlers that Google has look for updates in the content of the website. It is a no-brainer that websites, where the content is frequently updated, edited, added, and more, show that it is active. Compare this to a website that hardly updates its content ever.
Since these bots don’t know anything about the websites (other than the stripped-down HTML content), the only factor that shows that the website is active is frequent updates. Hence, the freshness of the content is a ranking factor.
For the readers, Google always tries to provide the latest information, the most recently updated one because as a reader you’d want to read the latest information too. No one wakes up in the morning and asks for last week’s newspaper because it is irrelevant.
Make it “Mobile”
Google and most of the other search engines such as Yahoo, and Bing are mobile-first indexing search engines. What this means is that Google takes the mobile version of a website’s page and ranks it, rather than the desktop version of it.
Many people and even some SEOs confuse it as a practice that favors mobile devices, but that is not the case. Mobile-friendliness here means that the website is not a static mess of information like the websites from the early 2000s. No one wants to visit those websites, and Google knows that.
Google cares about dynamic websites that can change their dimensions and intractability when accessed via different devices. It is no shocker that most of the website traffic comes from mobile devices such as smartphones and tablets. So creating a website that opens smoothly on both mobile devices and desktop is very crucial to get any ranking.
Page Ranking Factors
This is the section that holds over 200 other ranking factors, some known, some unknown. This entire section is so dense in the information that it needs a separate article to look at it in-depth. We’ll try to keep it as concise as possible. But if you want to dig deeper, you can check out our article on understanding SEO where we have discussed in-depth the page ranking factors.
There are many things that come under this ranking criterion, most of which have been introduced recently, as recent as 2021! Google’s aim with these factors was to provide only the best content to the users in the best way possible. So what entails this?
The most important factors here include page load speeds; How long it takes for a website to open a website (FCP), how long does the largest piece of content take to load completely (LCP), and the total shift in the website element position after completely loading (CLS). These are some of the many factors. Other factors include:
- Responsiveness
- Speed
- Loading time
- SSL certificate
- Use of keywords in the URL
- Short URL
- Meta description
- Schema structure
- Image alt text
- Server response time
- 4XX and 5XX errors in website
- AMP availability and much more
This list can go on. What makes the entire ranking far more complicated (and better for the users) is that these factors not only directly affect ranking but a complicated interaction between these factors also affects it. This is the reason why Google is able to provide such accurate results.
In other words, ranking is the most important part of displaying a search result, and it is the part that requires the most processing.
Personalized Results
Google has ways to understand the intent of a search. It can tell whether you want to buy an apple or you want to buy an Apple product. But what about search results that are served only for you? When you search for something through Google’s search engine, a set of identifiers (cookies, cache, and others) come into action to provide you with the most accurate and useful result, only for you.
So if you search for cake shops, Google will show you the nearest cake shops to you. Or if you search for a movie, it will show the timings and tickets available for the nearest theaters in your city. But this personalization goes even further. It is as if Google is inside your mind. Try doing this for yourself:
On your Google Chrome browser, just search for “Mount Everest”. After that, if you just type “how tall”, Google suggests you the height of Mount Everest. This is not a feature that came with Google from the beginning. This is the magic of their algorithms. So Google uses three major factors to provide a personalized result for you; Your location, history, and language.
The location gives them data on which places and services will be recommended to you, while your history is their greatest tool to give you the most customized results. Google analyses all your previous website visits, searches, and such and tries to find the relation between your current search and previous ones. The language that you speak also enables them to show results in languages you’ll understand. This is how Google shows you personalized results.
Indexed vs Non-Indexed pages: The Deep Web
Did you know that everything that you see on Google, Yahoo, and most of the other popular search engines is only about 5% of the entire content of the internet? The rest 95% of the “unindexed” part of the web is called the deep web. The deep web, as the name suggests, is the deep part of the web. Accessing this part can only be done via browsers such as Tor. It is highly advised not to visit the deep web because your privacy is compromised.
The Deep web comes from a collection of all the web pages that are not indexed by Google or any other popular search engines. Google takes the websites that are submitted to them and stores them in their index. When a search query is made, Googlebots look through only the index. All the non-indexed data is still there, on the web, but not on Google.
Conclusion
Search engines are complicated programs that are growing more and more complex as the amount of information on the internet is increasing and diversifying. With time, search engines will get more complex and smarter.
The processes that happen under the hood start from crawling the web pages, storing them in their database, and then executing millions of calculations in a fraction of a second to display the best results possible. Remember that almost all search engines have these three steps as their core functionality, with ranking being the most important part.
In simple terms, search engines have a depository of information with particular keywords written on each of them. When someone searches for the keyword, the algorithms do a quick lookup and serve the best result for the query.
While this explanation is simple, the actual process that happens, as we explained, is more complex than one can imagine. Searching for something is easy, but finding it is the difficult part. This was all on how search engines work.