GitHub Follow Botnet Exposed via Jaccard Similarity Analysis

8 Accounts, 6 Years Apart, Following Counts Within 25

A developer auditing his 97 GitHub followers noticed a statistical anomaly. Eight accounts—created between 2015 and 2021—each followed roughly 29,835 users. The following counts differed by at most 25. That's a hard pattern to explain organically.

He ran a naive cross-follow check: do these accounts follow each other? The matrix was all zeros. A shallow detector would stop there and clear them. But the real signal wasn't cross-following—it was following-list overlap.

The Jaccard Matrix: 0.99+ Similarity Across All Pairs

He pulled the full following lists for all 8 accounts (~29,800 entries each, ~238,000 records total, ~2,400 API requests) and computed pairwise Jaccard similarity. Every pair scored above 0.99. The cluster-level intersection: 29,682 accounts followed by all 8 simultaneously.

account_a	account_b	shared	jaccard
jaderytm	mariwatts	29,829	0.9998
kylehyne	mariwatts	29,831	0.9998
...	...	...	...

This pattern is consistent with a shared operator, automation pipeline, or seed list. The accounts deliberately avoid cross-following to evade detection, but their identical follow lists become the fingerprint.

Why This Matters: Detecting Sophisticated Botnets

Naive botnet detection looks at who accounts follow each other. Sophisticated operators defeat this by avoiding mutual follows. But they can't easily randomize their follow lists—because the seed list is the product. Changing it defeats the purpose.

The method generalizes to any platform that exposes following lists via API. The author provides a detection heuristic:

Jaccard Range	Interpretation
< 0.50	Likely independent
0.50–0.80	Possible shared source
0.80–0.95	Suspicious
0.95–0.99	Coordination likely
> 0.99	Strong coordination signal

The Code: Python Stdlib Only

The audit script uses only Python standard library. It fetches following lists via GitHub API, computes Jaccard similarity for all pairs, and prints the cluster intersection.

import urllib.request, json, os, time
from itertools import combinations

token = os.environ.get(&#34;GH_TOKEN&#34;)
headers = {
    &#34;Authorization&#34;: f&#34;token {token}&#34;,
    &#34;Accept&#34;: &#34;application/vnd.github.v3+json&#34;,
    &#34;User-Agent&#34;: &#34;gh-botnet-audit&#34;
}

def get_following(login): following = set() page = 1 while True: url = f"https://api.github.com/users/{login}/following?per_page=100&page={page}" req = urllib.request.Request(url, headers=headers) with urllib.request.urlopen(req, timeout=20) as r: data = json.loads(r.read()) if not data: break following.update(u['login'] for u in data) if len(data) < 100: break page += 1 time.sleep(0.1) return following

def jaccard(a, b): intersection = len(a & b) union = len(a | b) return intersection / union if union else 0

cluster = ["canestein", "hazexone", "domcomit", "kylehyne", "jaderytm", "vierystein", "hanyvert", "mariwatts"] following_sets = {} for login in cluster: following_sets[login] = get_following(login)

for a, b in combinations(cluster, 2): shared = len(following_sets[a] & following_sets[b]) j = jaccard(following_sets[a], following_sets[b]) print(f"{a:<20} {b:<20} shared={shared} jaccard={j:.4f}")

common = set.intersection(*following_sets.values()) print(f"Followed by ALL accounts: {len(common)}")


**Rate limit warning**: fetching ~29,800 entries per account costs ~300 API calls. GitHub&#39;s authenticated limit is 5,000/hour. For large clusters, spread runs across rate limit windows and increase `time.sleep()` from 0.1 to 0.5 to avoid secondary rate limits.

## Alternative Explanations

Before calling it a botnet, the author considers other possibilities:
- Could these accounts independently follow the same popular list? Not plausible at 0.9998 Jaccard across 29,800 accounts over 8 accounts created years apart.
- Could a shared import tool have seeded them? That&#39;s still coordination by another name.
- Could one be legitimate? The cluster-level result rules out coincidence.

## Practical Use Cases for This Botnet

The ~29,682 common follows are the operator&#39;s target list—likely a curated list of GitHub users. Potential uses:
- Engagement laundering: inflating follower counts on accounts used for phishing or spam
- Social proof for repositories seeding malicious packages
- Resale as &#34;established&#34; GitHub accounts

The author has reported the cluster to GitHub with supporting evidence.

## How to Find Similar Clusters in Your Followers

1. Identify candidates: multiple accounts with suspiciously similar following counts (&gt;500 following, no follower ratio), account ages spread across years.
2. Fetch full following lists for candidates.
3. Compute pairwise Jaccard similarity.
4. Compute cluster-level intersection.
5. Report with Jaccard scores and common following count.

## Tools

All tooling is in the author&#39;s BANANA_TREE repository, including `gh_botnet_audit.py` for follower scoring and overlap analysis, and `traffic_report.py` for GitHub + DEV.to analytics. Python stdlib only—no external dependencies.

The core insight: when accounts avoid cross-following but share an identical seed list, the overlap becomes the fingerprint. The more accounts in a cluster, the stronger the signal—and the harder it is to retroactively randomize without defeating the product.

GitHub Follow Botnet Exposed via Jaccard Similarity Analysis

8 Accounts, 6 Years Apart, Following Counts Within 25

The Jaccard Matrix: 0.99+ Similarity Across All Pairs

Why This Matters: Detecting Sophisticated Botnets

The Code: Python Stdlib Only

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Pegasus Spyware Hit EU Parliament Spy Probe Member Twice

MSI Center Vulnerability Grants SYSTEM Privileges via Named Pipe

Underhanded C Contest 2015: NaN Poisoning and Nuclear Verification

US Offers $10M Reward for Russian Signal/WhatsApp Phishers

Pegasus Spyware Hit EU Parliament Spy Probe Member Twice

Dan Luu: AI Coding Agents Hallucinate Bugs, Testing Still Works