5 个用于自动执行 SEO 任务的 Python 脚本

已发表: 2023-04-12

Python 是一种功能强大的编程语言，在过去几年中在 SEO 行业中广受欢迎。

凭借其相对简单的语法、高效的性能以及丰富的库和框架，Python 彻底改变了许多 SEO 的工作方式。

Python 提供了一个多功能的工具集，可以帮助使优化过程更快、更准确和更有效。

本文探讨了五个 Python 脚本，以帮助提高您的 SEO 效果。

自动化重定向映射。
批量编写元描述。
使用 N-gram 分析关键字。
将关键字分组到主题集群中。
将关键字列表与预定义主题列表相匹配。

开始使用 Python 的最简单方法

如果您想涉足 Python 编程，Google Colab 值得考虑。

它是一个免费的、基于 Web 的平台，为编写和运行 Python 代码提供了一个方便的平台，无需复杂的本地设置。

从本质上讲，它允许您在浏览器中访问 Jupyter Notebooks，并提供大量用于数据科学和机器学习的预安装库。

此外，它建立在 Google 云端硬盘之上，因此您可以轻松保存您的工作并与他人共享。

要开始，请按照下列步骤操作：

启用文件上传

打开 Google Colab 后，您首先需要启用创建临时文件存储库的功能。就像单击文件夹图标一样简单。

这使您可以上传临时文件，然后下载任何结果文件。

上传源数据

我们的许多 Python 脚本都需要源文件才能运行。要上传文件，只需单击上传按钮。

完成设置后，您可以开始测试以下 Python 脚本。

脚本 1：自动化重定向映射

为大型站点创建重定向映射可能非常耗时。寻找使流程自动化的方法可以帮助我们节省时间并专注于其他任务。

这个脚本是如何工作的

此脚本侧重于分析 Web 内容以找到紧密匹配的文章。

首先，它导入两个TXT文件的URLs：一个是重定向网站的（source_urls.txt），另一个是吸收重定向网站的站点（target_urls.txt）。
然后，我们使用 Python 库 Beautiful Soup 创建一个网络爬虫来获取页面上的主体内容。此脚本忽略页眉和页脚内容。
在爬取所有页面的内容后，它使用 Python 库 Polyfuzz 以相似度百分比匹配 URL 之间的内容。
最后，它将结果打印在 CSV 文件中，包括相似度百分比。

从这里，您可以手动查看任何相似性百分比较低的 URL 以找到下一个最接近的匹配项。

获取脚本

#import libraries from bs4 import BeautifulSoup, SoupStrainer from polyfuzz import PolyFuzz import concurrent.futures import csv import pandas as pd import requests #import urls with open("source_urls.txt", "r") as file: url_list_a = [line.strip() for line in file] with open("target_urls.txt", "r") as file: url_list_b = [line.strip() for line in file] #create a content scraper via bs4 def get_content(url_argument): page_source = requests.get(url_argument).text strainer = SoupStrainer('p') soup = BeautifulSoup(page_source, 'lxml', parse_only=strainer) paragraph_list = [element.text for element in soup.find_all(strainer)] content = " ".join(paragraph_list) return content #scrape the urls for content with concurrent.futures.ThreadPoolExecutor() as executor: content_list_a = list(executor.map(get_content, url_list_a)) content_list_b = list(executor.map(get_content, url_list_b)) content_dictionary = dict(zip(url_list_b, content_list_b)) #get content similarities via polyfuzz model = PolyFuzz("TF-IDF") model.match(content_list_a, content_list_b) data = model.get_matches() #map similarity data back to urls def get_key(argument): for key, value in content_dictionary.items(): if argument == value: return key return key with concurrent.futures.ThreadPoolExecutor() as executor: result = list(executor.map(get_key, data["To"])) #create a dataframe for the final results to_zip = list(zip(url_list_a, result, data["Similarity"])) df = pd.DataFrame(to_zip) df.columns = ["From URL", "To URL", "% Identical"] #export to a spreadsheet with open("redirect_map.csv", "w", newline="") as file: columns = ["From URL", "To URL", "% Identical"] writer = csv.writer(file) writer.writerow(columns) for row in to_zip: writer.writerow(row)

脚本 2：批量编写元描述

虽然元描述不是直接的排名因素，但它们可以帮助我们提高有机点击率。将元描述留空会增加 Google 创建自己的元描述的机会。

如果您的 SEO 审核显示大量 URL 缺少元描述，则可能很难抽出时间手工编写所有这些内容，尤其是对于电子商务网站。

该脚本旨在通过为您自动执行该过程来帮助您节省时间。

脚本是如何工作的

首先，该脚本从 TXT 文件 (urls.txt) 导入 URL 列表。
然后，它解析 URL 上的所有内容。
解析内容后，它会创建旨在少于 155 个字符的元描述。
它将结果导出到 CSV 文件中。

获取脚本

!pip install sumy from sumy.parsers.html import HtmlParser from sumy.nlp.tokenizers import Tokenizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words from sumy.summarizers.lsa import LsaSummarizer import csv #1) imports a list of URLs from a txt file with open('urls.txt') as f: urls = [line.strip() for line in f] results = [] # 2) analyzes the content on each URL for url in urls: parser = HtmlParser.from_url(url, Tokenizer("english")) stemmer = Stemmer("english") summarizer = LsaSummarizer(stemmer) summarizer.stop_words = get_stop_words("english") description = summarizer(parser.document, 3) description = " ".join([sentence._text for sentence in description]) if len(description) > 155: description = description[:152] + '...' results.append({ 'url': url, 'description': description }) # 4) exports the results to a csv file with open('results.csv', 'w', newline='') as f: writer = csv.DictWriter(f, fieldnames=['url','description']) writer.writeheader() writer.writerows(results)

脚本 3：使用 N-gram 分析关键字

N-gram 不是一个新概念，但对 SEO 仍然有用。它们可以帮助我们理解大量关键词数据的主题。

这个脚本是如何工作的

此脚本将结果输出到一个 TXT 文件中，该文件将关键字分解为一元组、二元组和三元组。

首先，它会导入包含所有关键字的 TXT 文件 (keyword.txt)。
然后它使用一个名为 Counter 的 Python 库来分析和提取 N-gram。
然后它将结果导出到一个新的 TXT 文件中。

获取此脚本

#Import necessary libraries import re from collections import Counter #Open the text file and read its contents into a list of words with open('keywords.txt', 'r') as f: words = f.read().split() #Use a regular expression to remove any non-alphabetic characters from the words words = [re.sub(r'[^a-zA-Z]', '', word) for word in words] #Initialize empty dictionaries for storing the unigrams, bigrams, and trigrams unigrams = {} bigrams = {} trigrams = {} #Iterate through the list of words and count the number of occurrences of each unigram, bigram, and trigram for i in range(len(words)): # Unigrams if words[i] in unigrams: unigrams[words[i]] += 1 else: unigrams[words[i]] = 1 # Bigrams if i < len(words)-1: bigram = words[i] + ' ' + words[i+1] if bigram in bigrams: bigrams[bigram] += 1 else: bigrams[bigram] = 1 # Trigrams if i < len(words)-2: trigram = words[i] + ' ' + words[i+1] + ' ' + words[i+2] if trigram in trigrams: trigrams[trigram] += 1 else: trigrams[trigram] = 1 # Sort the dictionaries by the number of occurrences sorted_unigrams = sorted(unigrams.items(), key=lambda x: x[1], reverse=True) sorted_bigrams = sorted(bigrams.items(), key=lambda x: x[1], reverse=True) sorted_trigrams = sorted(trigrams.items(), key=lambda x: x[1], reverse=True) # Write the results to a text file with open('results.txt', 'w') as f: f.write("Most common unigrams:\n") for unigram, count in sorted_unigrams[:10]: f.write(unigram + ": " + str(count) + "\n") f.write("\nMost common bigrams:\n") for bigram, count in sorted_bigrams[:10]: f.write(bigram + ": " + str(count) + "\n") f.write("\nMost common trigrams:\n") for trigram, count in sorted_trigrams[:10]: f.write(trigram + ": " + str(count) + "\n")

脚本 4：将关键字分组到主题集群中

对于新的 SEO 项目，关键字研究始终处于早期阶段。有时我们在一个数据集中处理数千个关键字，这使得分组具有挑战性。

Python 允许我们自动将关键字聚类到相似的组中，以识别趋势趋势并完成我们的关键字映射。

这个脚本是如何工作的

此脚本首先导入关键字的 TXT 文件 (keywords.txt)。
然后脚本使用 TfidfVectorizer 和 AffinityPropagation 分析关键字。
然后它为每个主题集群分配一个数值。
然后将结果导出到 csv 文件中。

获取此脚本

import csv import numpy as np from sklearn.cluster import AffinityPropagation from sklearn.feature_extraction.text import TfidfVectorizer # Read keywords from text file with open("keywords.txt", "r") as f: keywords = f.read().splitlines() # Create a Tf-idf representation of the keywords vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(keywords) # Perform Affinity Propagation clustering af = AffinityPropagation().fit(X) cluster_centers_indices = af.cluster_centers_indices_ labels = af.labels_ # Get the number of clusters found n_clusters = len(cluster_centers_indices) # Write the clusters to a csv file with open("clusters.csv", "w", newline="") as f: writer = csv.writer(f) writer.writerow(["Cluster", "Keyword"]) for i in range(n_clusters): cluster_keywords = [keywords[j] for j in range(len(labels)) if labels[j] == i] if cluster_keywords: for keyword in cluster_keywords: writer.writerow([i, keyword]) else: writer.writerow([i, ""])

脚本 5：将关键字列表与预定义主题列表匹配

这类似于前面的脚本，不同之处在于它允许您将关键字列表与一组预定义的主题相匹配。

这对于大量关键字非常有用，因为它以 1,000 个为一组处理它们以防止系统崩溃。

这个脚本是如何工作的

此脚本导入关键字列表 (keywords.txt) 和主题列表 (topics.txt)。
然后它分析主题和关键字列表并将它们与最接近的匹配项进行匹配。如果找不到匹配项，则会将其归类为其他。
然后将结果导出到 CSV 文件中。

获取此脚本

import pandas as pd import spacy from spacy.lang.en.stop_words import STOP_WORDS # Load the Spacy English language model nlp = spacy.load("en_core_web_sm") # Define the batch size for keyword analysis BATCH_SIZE = 1000 # Load the keywords and topics files as Pandas dataframes keywords_df = pd.read_csv("keywords.txt", header=None, names=["keyword"]) topics_df = pd.read_csv("topics.txt", header=None, names=["topic"]) # Define a function to categorize a keyword based on the closest related topic def categorize_keyword(keyword): # Tokenize the keyword tokens = nlp(keyword.lower()) # Remove stop words and punctuation tokens = [token.text for token in tokens if not token.is_stop and not token.is_punct] # Find the topic that has the most token overlaps with the keyword max_overlap = 0 best_topic = "Other" for topic in topics_df["topic"]: topic_tokens = nlp(topic.lower()) topic_tokens = [token.text for token in topic_tokens if not token.is_stop and not token.is_punct] overlap = len(set(tokens).intersection(set(topic_tokens))) if overlap > max_overlap: max_overlap = overlap best_topic = topic return best_topic # Define a function to process a batch of keywords and return the results as a dataframe def process_keyword_batch(keyword_batch): results = [] for keyword in keyword_batch: category = categorize_keyword(keyword) results.append({"keyword": keyword, "category": category}) return pd.DataFrame(results) # Initialize an empty dataframe to hold the results results_df = pd.DataFrame(columns=["keyword", "category"]) # Process the keywords in batches for i in range(0, len(keywords_df), BATCH_SIZE): keyword_batch = keywords_df.iloc[i:i+BATCH_SIZE]["keyword"].tolist() batch_results_df = process_keyword_batch(keyword_batch) results_df = pd.concat([results_df, batch_results_df]) # Export the results to a CSV file results_df.to_csv("results.csv", index=False)

使用 Python 进行 SEO

对于 SEO 专业人员来说，Python 是一种非常强大且用途广泛的工具。

无论您是初学者还是经验丰富的从业者，我在本文中分享的免费脚本都为探索 Python 在 SEO 中的可能性提供了一个很好的起点。

凭借其直观的语法和大量的库，Python 可以帮助您自动执行繁琐的任务、分析复杂的数据并获得对网站性能的新见解。那么为什么不试一试呢？

祝你好运，编码愉快！

本文中表达的观点是客座作者的观点，不一定是 Search Engine Land。 此处列出了工作人员作者。

将 Search Engine Land 添加到您的 Google 新闻提要中。