âŹ ď¸ Increasingly Connected đ§ SXSW 2025: Senior Moments âĄď¸
Finding My Niche
This week we take a look at newsletter specialization ideation as a JupyterLab exercise.
Getting Informed
If feedback is a gift, then reader feedback is a gift that keeps on giving. This week, I received feedback on the newsletter.
You are all over the place. Instead of the map of Jayâs mind, why not pick a niche topic and go very very deep?
This reader feedback got me thinking. What niche topic would be one I could go very deep on and keep up my regular pace of updates in a newsletter?
Where do I start?
Could my prior writing have a theme that isnât obvious to me? What are the patterns buried within my ramblings?
I have posts in markdown format but my tagging has been lagging. If only there was a way to perform analysis on all my postsâŚ
Oh, thatâs right. I can open JupyterLab and scratch this itch in a few minutes.
â ~ git:(main) â /opt/homebrew/opt/jupyterlab/bin/jupyter-lab
open JupyterLab
Side note: If you want to try this yourself, Iâm adding to my buttondown-python-scripts in the JupyterLab
folder.
https://github.com/JayCuthrell/buttondown-python-scripts
Goals
The script needs to be simple enough for me to grasp it. I used Google Gemini to help me make better mistakes. đ¤Ł
- Analyze markdown files (.md) by extracting the text, cleaning it up, and identifying the most and least common words, two-word phrases (bigrams), and three-word phrases (trigrams) to help understand the key topics and language patterns.
- Use mature and documented libraries such as
glob
,markdown
,re
,nltk
,collections.Counter
,nltk.tokenize
,nltk.corpus
,nltk.util
, andbs4
for the required language resources (tokenizers, stop words, formatting, etc.). - For each file, convert markdown to plain text then use
BeautifulSoup
to strip out HTML code, remove punctuation, convert everything to lowercase, and tokenize. - Apply filters as lists to keep only the words that are not in the filter list (i.e. things like numbers and weasel words).
- Gather word frequencies and show the top results such as the 5 least common, 25 most common trigrams, 25 most common words, and 25 most common bigrams, along with their counts.
Setup the environment
pip install --upgrade pip
pip install pandas
pip install requests
pip install markdown
pip install textblob
pip install beautifulsoup4
pip install nltk
pip install gensim
pip install spacy
Customizing as I go on each run
- Run the script and update filters by appending to
weasel_words
- Update the other filters to get rid of non-topical words like time references
- Alter the numerical values for returned results
import glob
import markdown
import re
import nltk
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from bs4 import BeautifulSoup
# Download NLTK data if needed
nltk.download('punkt') # For tokenization
nltk.download('stopwords') # For stop word removal
nltk.download('averaged_perceptron_tagger')
# 1. Define Folder Path and Load Markdown Files
folder_path = '/path/to/my/blog/posts/*.md' # Your folder path
markdown_files = glob.glob(folder_path)
all_text = ""
for file_path in markdown_files:
with open(file_path, 'r') as file:
md_text = file.read()
plain_text = markdown.markdown(md_text)
# Remove HTML tags
soup = BeautifulSoup(plain_text, 'html.parser')
plain_text = soup.get_text()
all_text += plain_text + " "
# 2. Clean and Preprocess Text
all_text = re.sub(r'[^\w\s]', '', all_text).lower()
words = word_tokenize(all_text)
# Extended list of words to filter out
weasel_words = {"subscribe", "linkedin", "people", "good", "email",
"shot"} # Add your words
days_of_week = {"timeline", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"}
months = {"january", "february", "march", "april", "may", "june", "july", "august", "september",
"october", "november", "december", "jan", "feb", "mar", "apr", "may", "jun", "jul",
"aug", "sep", "oct", "nov", "dec"} # Added abbreviations
week_variations = {"week", "weeks", "weekly"}
words_to_filter = set(stopwords.words('english'))
words_to_filter.update(weasel_words, days_of_week, months, week_variations)
# Filter out words that are numbers
filtered_words = [word for word in words if not word in words_to_filter and not word.isdigit()]
# 3. Analyze for Most Common Words, Bigrams, and Trigrams
word_counts = Counter(filtered_words)
top_words = word_counts.most_common(25)
bigrams = ngrams(filtered_words, 2)
bigram_counts = Counter(bigrams)
top_bigrams = bigram_counts.most_common(25)
trigrams = ngrams(filtered_words, 3)
trigram_counts = Counter(trigrams)
top_trigrams = trigram_counts.most_common(25)
# Get the 5 least common trigrams
least_common_trigrams = trigram_counts.most_common()[-5:]
# Note: We reverse the order using `[-5:]` to get the least common ones.
print("\nTop 5 Least Common Three-Word Phrases:")
for trigram, count in least_common_trigrams:
print(f"{trigram}: {count}")
print("\nTop 25 Most Common Three-Word Phrases:")
for trigram, count in top_trigrams:
print(f"{trigram}: {count}")
print("\nTop 25 Most Common Words:")
for word, count in top_words:
print(f"{word}: {count}")
print("\nTop 25 Most Common Two-Word Phrases:")
for bigram, count in top_bigrams:
print(f"{bigram}: {count}")
And the resultsâŚ
Top 5 Least Common Three-Word Phrases:
(âfightingâ, âcheckâ, âinnovationâ): 1
(âcheckâ, âinnovationâ, âcustomerâ): 1
(âinnovationâ, âcustomerâ, ârfidâ): 1
(âcustomerâ, ârfidâ, âsuppliersâ): 1
(ârfidâ, âsuppliersâ, âsectorâ): 1
Top 25 Most Common Three-Word Phrases:
(âedgeâ, âcoreâ, âcloudâ): 26
(âhyperscaleâ, âcloudâ, âprovidersâ): 25
(âinternalâ, âdeveloperâ, âplatformâ): 22
(âlowâ, âcodeâ, âcodeâ): 17
(âcloudâ, âdataâ, âservicesâ): 15
(âdeveloperâ, âplatformâ, âidpâ): 12
(âbusinessâ, âvalueâ, âengineeringâ): 11
(âgartnerâ, âhypeâ, âcycleâ): 11
(âdeveloperâ, âexperienceâ, âdevxâ): 11
(âsoftwareâ, âsupplyâ, âchainâ): 11
(âsiteâ, âreliabilityâ, âengineeringâ): 10
(âcloudâ, âstatusâ, âdashboardsâ): 10
(âstatusâ, âdashboardsâ, âcloudâ): 10
(âoracleâ, âcloudâ, âinfrastructureâ): 10
(âsiloâ, âspreadsheetâ, âsprawlâ): 10
(âsupplyâ, âchainâ, âsecurityâ): 10
(âawsâ, âazureâ, âgcpâ): 9
(âgoogleâ, âcloudâ, âplatformâ): 9
(âmulticloudâ, âdataâ, âservicesâ): 9
(âcloudâ, âengineeringâ, âsloâ): 8
(âoranâ, âvranâ, â5gâ): 8
(âcloudâ, âimpactâ, âmappingâ): 8
(âcloudâ, âmeanâ, ârcaâ): 8
(âhyperscaleâ, âcloudâ, âproviderâ): 8
(âhypeâ, âcycleâ, âemergingâ): 8
Top 25 Most Common Words:
cloud: 562
data: 502
ai: 376
platform: 231
google: 229
technology: 212
engineering: 210
software: 208
services: 207
experience: 166
internet: 158
developer: 150
access: 148
business: 139
code: 138
market: 135
computing: 135
company: 132
network: 131
security: 130
product: 130
tools: 123
looking: 123
tech: 123
technologies: 122
Top 25 Most Common Two-Word Phrases:
(âplatformâ, âengineeringâ): 83
(âquantumâ, âcomputingâ): 48
(âcloudâ, âprovidersâ): 43
(âmachineâ, âlearningâ): 41
(âzeroâ, âtrustâ): 40
(âlowâ, âcodeâ): 37
(âdeveloperâ, âexperienceâ): 36
(âdataâ, âcentersâ): 36
(âhyperscaleâ, âcloudâ): 34
(âcloudâ, âproviderâ): 31
(âgoogleâ, âcloudâ): 31
(âinternalâ, âdeveloperâ): 30
(âgenerativeâ, âaiâ): 29
(âdataâ, âservicesâ): 28
(âedgeâ, âcoreâ): 28
(âsupplyâ, âchainâ): 27
(âcoreâ, âcloudâ): 26
(âimpactâ, âmappingâ): 26
(âstatusâ, âdashboardsâ): 25
(âaiâ, âmlâ): 24
(âmeanâ, ârcaâ): 24
(âmergersâ, âacquisitionsâ): 24
(âartificialâ, âintelligenceâ): 22
(âoracleâ, âcloudâ): 22
(âdeveloperâ, âplatformâ): 22
Have I found my niche?
Based on these swirling telecom, cloud, and A.I. themes, Iâm leaning towards a newsletter niche that provides a consistent format tracking where the edge and machine learning meet. More to come, but thatâs my thinking as of this edition.
Stay tunedâŚ
Disclosure
I am linking to my disclosure.