Finding My Niche

July 21, 2024

This week we take a look at newsletter specialization ideation as a JupyterLab exercise.

Getting Informed

If feedback is a gift, then reader feedback is a gift that keeps on giving. This week, I received feedback on the newsletter.

You are all over the place. Instead of the map of Jay’s mind, why not pick a niche topic and go very very deep?

This reader feedback got me thinking. What niche topic would be one I could go very deep on and keep up my regular pace of updates in a newsletter?

Where do I start?

Could my prior writing have a theme that isn’t obvious to me? What are the patterns buried within my ramblings?

I have posts in markdown format but my tagging has been lagging. If only there was a way to perform analysis on all my posts…

Oh, that’s right. I can open JupyterLab and scratch this itch in a few minutes.

➜  ~ git:(main) ✗ /opt/homebrew/opt/jupyterlab/bin/jupyter-lab

open JupyterLab

Side note: If you want to try this yourself, I’m adding to my buttondown-python-scripts in the JupyterLab folder.

https://github.com/JayCuthrell/buttondown-python-scripts

Goals

The script needs to be simple enough for me to grasp it. I used Google Gemini to help me make better mistakes. 🤣

Analyze markdown files (.md) by extracting the text, cleaning it up, and identifying the most and least common words, two-word phrases (bigrams), and three-word phrases (trigrams) to help understand the key topics and language patterns.
Use mature and documented libraries such as glob, markdown, re, nltk, collections.Counter, nltk.tokenize, nltk.corpus, nltk.util, and bs4 for the required language resources (tokenizers, stop words, formatting, etc.).
For each file, convert markdown to plain text then use BeautifulSoup to strip out HTML code, remove punctuation, convert everything to lowercase, and tokenize.
Apply filters as lists to keep only the words that are not in the filter list (i.e. things like numbers and weasel words).
Gather word frequencies and show the top results such as the 5 least common, 25 most common trigrams, 25 most common words, and 25 most common bigrams, along with their counts.

Setup the environment

pip install --upgrade pip
pip install pandas
pip install requests
pip install markdown
pip install textblob
pip install beautifulsoup4
pip install nltk
pip install gensim
pip install spacy

Customizing as I go on each run

Run the script and update filters by appending to weasel_words
Update the other filters to get rid of non-topical words like time references
Alter the numerical values for returned results

import glob
import markdown
import re
import nltk
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from bs4 import BeautifulSoup

# Download NLTK data if needed
nltk.download('punkt')  # For tokenization
nltk.download('stopwords')  # For stop word removal
nltk.download('averaged_perceptron_tagger')

# 1. Define Folder Path and Load Markdown Files
folder_path = '/path/to/my/blog/posts/*.md'  # Your folder path
markdown_files = glob.glob(folder_path)
all_text = ""

for file_path in markdown_files:
    with open(file_path, 'r') as file:
        md_text = file.read()
        plain_text = markdown.markdown(md_text)

        # Remove HTML tags
        soup = BeautifulSoup(plain_text, 'html.parser')
        plain_text = soup.get_text()

        all_text += plain_text + " " 

# 2. Clean and Preprocess Text
all_text = re.sub(r'[^\w\s]', '', all_text).lower() 
words = word_tokenize(all_text)

# Extended list of words to filter out

weasel_words = {"subscribe", "linkedin", "people", "good", "email",
                "shot"} # Add your words

days_of_week = {"timeline", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"}
months = {"january", "february", "march", "april", "may", "june", "july", "august", "september",
          "october", "november", "december", "jan", "feb", "mar", "apr", "may", "jun", "jul",
          "aug", "sep", "oct", "nov", "dec"}  # Added abbreviations
week_variations = {"week", "weeks", "weekly"}

words_to_filter = set(stopwords.words('english'))
words_to_filter.update(weasel_words, days_of_week, months, week_variations) 

# Filter out words that are numbers
filtered_words = [word for word in words if not word in words_to_filter and not word.isdigit()]

# 3. Analyze for Most Common Words, Bigrams, and Trigrams
word_counts = Counter(filtered_words)
top_words = word_counts.most_common(25)

bigrams = ngrams(filtered_words, 2)
bigram_counts = Counter(bigrams)
top_bigrams = bigram_counts.most_common(25)

trigrams = ngrams(filtered_words, 3)
trigram_counts = Counter(trigrams)
top_trigrams = trigram_counts.most_common(25)

# Get the 5 least common trigrams
least_common_trigrams = trigram_counts.most_common()[-5:] 
# Note:  We reverse the order using `[-5:]` to get the least common ones.

print("\nTop 5 Least Common Three-Word Phrases:")
for trigram, count in least_common_trigrams:
    print(f"{trigram}: {count}")  

print("\nTop 25 Most Common Three-Word Phrases:")
for trigram, count in top_trigrams:
    print(f"{trigram}: {count}")

print("\nTop 25 Most Common Words:")
for word, count in top_words:
    print(f"{word}: {count}")

print("\nTop 25 Most Common Two-Word Phrases:")
for bigram, count in top_bigrams:
    print(f"{bigram}: {count}")

Top 5 Least Common Three-Word Phrases:
(‘fighting’, ‘check’, ‘innovation’): 1
(‘check’, ‘innovation’, ‘customer’): 1
(‘innovation’, ‘customer’, ‘rfid’): 1
(‘customer’, ‘rfid’, ‘suppliers’): 1
(‘rfid’, ‘suppliers’, ‘sector’): 1

Top 25 Most Common Three-Word Phrases:
(‘edge’, ‘core’, ‘cloud’): 26
(‘hyperscale’, ‘cloud’, ‘providers’): 25
(‘internal’, ‘developer’, ‘platform’): 22
(‘low’, ‘code’, ‘code’): 17
(‘cloud’, ‘data’, ‘services’): 15
(‘developer’, ‘platform’, ‘idp’): 12
(‘business’, ‘value’, ‘engineering’): 11
(‘gartner’, ‘hype’, ‘cycle’): 11
(‘developer’, ‘experience’, ‘devx’): 11
(‘software’, ‘supply’, ‘chain’): 11
(‘site’, ‘reliability’, ‘engineering’): 10
(‘cloud’, ‘status’, ‘dashboards’): 10
(‘status’, ‘dashboards’, ‘cloud’): 10
(‘oracle’, ‘cloud’, ‘infrastructure’): 10
(‘silo’, ‘spreadsheet’, ‘sprawl’): 10
(‘supply’, ‘chain’, ‘security’): 10
(‘aws’, ‘azure’, ‘gcp’): 9
(‘google’, ‘cloud’, ‘platform’): 9
(‘multicloud’, ‘data’, ‘services’): 9
(‘cloud’, ‘engineering’, ‘slo’): 8
(‘oran’, ‘vran’, ‘5g’): 8
(‘cloud’, ‘impact’, ‘mapping’): 8
(‘cloud’, ‘mean’, ‘rca’): 8
(‘hyperscale’, ‘cloud’, ‘provider’): 8
(‘hype’, ‘cycle’, ‘emerging’): 8

Top 25 Most Common Words:
cloud: 562
data: 502
ai: 376
platform: 231
google: 229
technology: 212
engineering: 210
software: 208
services: 207
experience: 166
internet: 158
developer: 150
access: 148
business: 139
code: 138
market: 135
computing: 135
company: 132
network: 131
security: 130
product: 130
tools: 123
looking: 123
tech: 123
technologies: 122

Top 25 Most Common Two-Word Phrases:
(‘platform’, ‘engineering’): 83
(‘quantum’, ‘computing’): 48
(‘cloud’, ‘providers’): 43
(‘machine’, ‘learning’): 41
(‘zero’, ‘trust’): 40
(‘low’, ‘code’): 37
(‘developer’, ‘experience’): 36
(‘data’, ‘centers’): 36
(‘hyperscale’, ‘cloud’): 34
(‘cloud’, ‘provider’): 31
(‘google’, ‘cloud’): 31
(‘internal’, ‘developer’): 30
(‘generative’, ‘ai’): 29
(‘data’, ‘services’): 28
(‘edge’, ‘core’): 28
(‘supply’, ‘chain’): 27
(‘core’, ‘cloud’): 26
(‘impact’, ‘mapping’): 26
(‘status’, ‘dashboards’): 25
(‘ai’, ‘ml’): 24
(‘mean’, ‘rca’): 24
(‘mergers’, ‘acquisitions’): 24
(‘artificial’, ‘intelligence’): 22
(‘oracle’, ‘cloud’): 22
(‘developer’, ‘platform’): 22

Have I found my niche?

Based on these swirling telecom, cloud, and A.I. themes, I’m leaning towards a newsletter niche that provides a consistent format tracking where the edge and machine learning meet. More to come, but that’s my thinking as of this edition.

Stay tuned…

Disclosure

I am linking to my disclosure.

View this page on GitHub.