Analyze Google SERP Snippets with Python

“Google rewrites 62% of Page Title Tags.”

According to a recent study by Zyppy when analyzing more than 80,000 title tags.

In response to Google rewriting title tags and meta descriptions, I’ve been using a combination of Python modules and SERP API’s to show clients what’s actually being shown on Google for their pages.

And many times we are very surprised by what Google decides to show in the SERP…

Seeing how pages are actually appearing in the Google SERP is a good way of analyzing potential optimization opportunities and can help ask the following questions:

  • Is Google rewriting my metadata?
  • How closely related is my predefined metadata to what’s showing in the SERP?
  • What content is Google using to rewrite my metadata?
  • Is there opportunity to optimize metadata to better match search queries?

Prerequisites

In order to utilize this script for your own SEO goals you will need:

  • An intermediate understanding of Python and API requests.
  • A paid or trial plan with a SERP scraper API.
  • An indexed website to analyze.

Python Modules

For comparing metadata to Google SERP snippets, I first use Beautiful Soup to parse HTML documents and gather pre-defined metadata.

To gather SERP snippet data I use a SERP API such as Oxylabs or Scale SERP.

For comparing SERP snippets to metadata I leverage FuzzyWuzzy string matching. FuzzyWuzzy uses Levenshtein Distance to calculate a similarity index (from 0-100, 100 being identically matched and 0 being no similarity).

And lastly I utilize openpyxl to export our data to an easy to read Excel spreadsheet.

# import the following modules

import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import date
from fuzzywuzzy import fuzz
from openpyxl import Workbook

# Use SERP API key and SERP Query URL as env variables

API_KEY = os.environ['SCALE_SERP_KEY']
QUERY_URL = os.environ['SCALE_SERP_QUERY_URL']

Import Data to Python

Gather a list of URLs you’d like to analyze and make sure the URLs are listed in the first column of your file.

I am using an Excel (xlsx) file in my current working directory to easily convert our spreadsheet full of URLs to a python list.

# The name of the excel file containing a list of URLs

url_file = 'file_urls.xlsx'

data = pd.read_excel(urls, sheet_name=0)

urls_list = [url for url in data[data.columns[0]]]

Scrape HTML Document for metadata

In the first portion of this script, our for loop will be making an HTTP request to all URLs in our list using the requests module (I’m using an Android Mobile user agent string, feel free to use others). We then use a response method to return the actual content of the HTTPS response, and on our way to scraping we go.

Beautiful Soup then begins it’s HTML parsing, and extracts the title tag text and meta description content. If the site being analyzed is a JavaScript website, then Beautiful Soup will be unable to gather the metadata information.

output_list = {}
counter = 0
row = 2

for url in urls_list:
    print(f'Analyzing :: {url}')
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 12; SM-S906N Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/80.0.3987.119 Mobile Safari/537.36'
    }
    soup_response = requests.get(url, headers=headers)
    status = soup_response.status_code
    analyzed_url = soup_response.text
    try:
        b_soup = BeautifulSoup(analyzed_url, 'html.parser')
        soup_title = b_soup.title.text
        soup_description = b_soup.find(name='meta', attrs={'name': 'description'})['content']

    except TypeError:
        soup_title = "Error, title cannot be retrieved"
        soup_description = "Error, meta description cannot be retrieved"

Request SERP Snippets

Within our same for loop as the step above, a simple request is made to the SERP API. The API response returns useful SERP data, such as the title link text and description, that will be used to compare against each URLs pre-defined metadata.

Some URLs within your list may not be indexed, in that case I use python exception handling to label non-indexed URLs accordingly.

    params = {
        'api_key': API_KEY,
        'q': f'site:{url}',
    }

    serp_response = requests.get(
        QUERY_URL,
        params=params,
    )


    try:
        serp_description = serp_response.json()["organic_results"][0]["snippet"]
        serp_title = serp_response.json()["organic_results"][0]["title"]
        query_displayed = serp_response.json()["search_information"]["query_displayed"]

    except KeyError:
        print(f"{url} is currently not indexed")

Compare Results Using FuzzyWuzzy

One of the most straightforward methods of comparing strings in python is through the FuzzyWuzzy library. Using a score out of 100, FuzzyWuzzy determines a similarity between the URLs pre-defined title and SERP title, and again determines a similarity score between the URLs pre-defined meta description and SERP description.

A score of 100 is given when two strings are identical, while two strings become less similar as they approach 0.

If a URL is not indexed, the match score for the title and description default to 0 because we are not able to perform string matching.

# continues within the original for loop   

 if 'Error' in soup_title and soup_description:
        title_compare = 0
        desc_compare = 0
    else:
        title_compare = fuzz.ratio(soup_title, serp_title)
        desc_compare = fuzz.ratio(soup_description, serp_description)

    output_list[str(url)] = [
        status,     
        soup_title,
        serp_title,
        title_compare,
        soup_description,
        serp_description,
        desc_compare
    ]

    counter += 1
    print(f"progress:  {str(counter)} of {len(urls_list)}")

Export Results to Excel

Once the for loop has completed its iterations, we then leverage Openpyxl to format our dictionary to an Excel file.

The datatime module is used to save our file using the current date (%Y-%m-%d) for simple file management.

wb = Workbook()
ws = wb.active

file_name = "serp_metadata_analysis"
dest_file = f'{file_name}{today}.xlsx'

for key, values in output_list.items():

    ws.cell(row=1, column=1).value = 'URL'
    ws.cell(row=1, column=2).value = 'Status Code'
    ws.cell(row=1, column=3).value = 'Title'
    ws.cell(row=1, column=4).value = 'SERP Title'
    ws.cell(row=1, column=5).value = 'Title Compare'
    ws.cell(row=1, column=6).value = 'Meta Description'
    ws.cell(row=1, column=7).value = 'SERP Description'
    ws.cell(row=1, column=8).value = 'Desc. Compare'

    ws.cell(row=row, column=1).value = key
    column = 2
    for value in values:
        ws.cell(row=row, column=column).value = value
        column += 1
    row += 1

Try the Micro Python App!

Try the micro python app below to see the script work for a single URL. Feel free to explore the Github Repo, recommendations are always welcome!


Posted

in

by

Discover more from Tyler Gargula

Subscribe now to keep reading and get access to the full archive.

Continue reading