Executive Summary #
This report evaluates approaches for extracting content from 218 MIME-encoded .eml newsletter files and organizing them into a structured Markdown content directory. After analyzing the email structure (multipart MIME with quoted-printable encoding, UTF-8 characters, and embedded emojis), I evaluated four primary approaches: Python with the email stdlib plus html2text, Node.js with mailparser plus turndown, Deno with postal-mime, and alternative Python combinations.
Recommendation: Python with the standard library email module plus html2text for HTML-to-Markdown conversion. This combination offers the best balance of simplicity (stdlib handles MIME parsing natively), reliability (handles quoted-printable and UTF-8 automatically), and ecosystem maturity. Python's email module has 556K+ weekly PyPI downloads for html2text1, battle-tested documentation, and requires minimal dependencies. The Node.js alternative (mailparser + turndown) is excellent but adds complexity for a batch processing task that does not require streaming or real-time capabilities.
The newsletters follow predictable category patterns (FOD#, Topic #, AI 101, etc.) that can be extracted via regex from subject lines, enabling automatic organization into content subdirectories.
Recommendation: Python email + html2text #
Why This Choice #
Python's standard library email module provides native MIME parsing without external dependencies, while html2text (originally by Aaron Swartz) offers mature, configurable HTML-to-Markdown conversion.23
Simplicity: The Python email module is part of the standard library - no installation required for parsing. The BytesParser with policy.default handles multipart messages, quoted-printable decoding, and charset conversion automatically. html2text installation is a single pip install html2text.
Popularity:
- html2text: ~556K weekly PyPI downloads, 2.1K GitHub stars1
- Python email module: Part of stdlib, used in millions of projects
- Compare to Node.js mailparser: ~1.3M weekly npm downloads, 1.6K GitHub stars4
Support: Both libraries are actively maintained. html2text had a release in April 2025. Python's email module receives ongoing maintenance as part of the standard library.
Philosophy and Core Concepts #
Python's email library follows the "batteries included" philosophy - MIME parsing should be straightforward without external dependencies. The library provides two interfaces:5
- Parser API: Load entire messages from strings, bytes, or files
- FeedParser API: Incremental parsing for streaming scenarios
For batch processing .eml files, the Parser API is ideal. The key insight is using policy.default which provides modern, UTF-8-aware parsing behavior.
html2text converts HTML to Markdown by traversing the DOM and applying formatting rules. It aims to produce "clean, easy-to-read plain ASCII text" that is also valid Markdown.2 The library handles:
- Block elements (headings, paragraphs, lists)
- Inline formatting (bold, italic, links)
- Images (converts to Markdown image syntax or extracts alt text)
- Tables (basic support)
- Unicode/emoji preservation
Getting Started #
Installation #
1# Create virtual environment (recommended)
2python -m venv venv
3source venv/bin/activate # On Windows: venv\Scripts\activate
4
5# Install html2text
6pip install html2text
7
8# Optional: Install beautifulsoup4 for HTML pre-cleaning
9pip install beautifulsoup4
Basic Setup #
1# eml_parser.py
2from email import policy
3from email.parser import BytesParser
4import html2text
5from pathlib import Path
6from datetime import datetime
7import re
8
9# Configure html2text converter
10def create_converter():
11 """Create and configure html2text converter with optimal settings."""
12 h = html2text.HTML2Text()
13 h.ignore_links = False # Keep links as Markdown
14 h.ignore_images = False # Keep image references
15 h.ignore_emphasis = False # Keep bold/italic
16 h.body_width = 0 # Don't wrap lines (let Markdown handle it)
17 h.unicode_snob = True # Use Unicode characters
18 h.skip_internal_links = True # Skip anchor links
19 h.inline_links = True # Use inline link style [text](url)
20 h.protect_links = True # Don't wrap URLs
21 h.ignore_tables = False # Convert tables
22 h.single_line_break = False # Use double line breaks for paragraphs
23 return h
Usage Guide #
Basic Usage: Parsing a Single EML File #
1from email import policy
2from email.parser import BytesParser
3from pathlib import Path
4
5def parse_eml(file_path: Path) -> dict:
6 """
7 Parse an .eml file and extract metadata and content.
8
9 Returns a dict with:
10 - subject: Email subject line
11 - date: Parsed datetime object
12 - sender: From address
13 - text_content: Plain text body (if available)
14 - html_content: HTML body (if available)
15 """
16 with open(file_path, 'rb') as fp:
17 # Use policy.default for modern, UTF-8-aware parsing
18 msg = BytesParser(policy=policy.default).parse(fp)
19
20 result = {
21 'subject': msg.get('subject', ''),
22 'date': msg.get('date'), # Returns datetime object with policy.default
23 'sender': msg.get('from', ''),
24 'text_content': None,
25 'html_content': None,
26 }
27
28 # Extract body content
29 # get_body() returns the best candidate for the message body
30 # preferencelist determines priority order
31
32 # Try to get plain text first
33 text_part = msg.get_body(preferencelist=('plain',))
34 if text_part:
35 result['text_content'] = text_part.get_content()
36
37 # Also get HTML for conversion
38 html_part = msg.get_body(preferencelist=('html',))
39 if html_part:
40 result['html_content'] = html_part.get_content()
41
42 return result
43
44# Example usage
45email_data = parse_eml(Path('emails/FOD#64_ Golden Age for Indie Devs and Engineers.eml'))
46print(f"Subject: {email_data['subject']}")
47print(f"Date: {email_data['date']}")
Converting HTML to Markdown #
1import html2text
2from bs4 import BeautifulSoup # Optional, for pre-cleaning
3
4def html_to_markdown(html_content: str, clean_first: bool = True) -> str:
5 """
6 Convert HTML email content to clean Markdown.
7
8 Args:
9 html_content: Raw HTML string
10 clean_first: Whether to pre-clean HTML with BeautifulSoup
11
12 Returns:
13 Markdown-formatted string
14 """
15 if clean_first:
16 # Pre-clean HTML to remove email-specific cruft
17 soup = BeautifulSoup(html_content, 'html.parser')
18
19 # Remove style tags
20 for style in soup.find_all('style'):
21 style.decompose()
22
23 # Remove script tags (shouldn't be in emails, but safety first)
24 for script in soup.find_all('script'):
25 script.decompose()
26
27 # Remove hidden elements
28 for hidden in soup.find_all(style=re.compile(r'display:\s*none')):
29 hidden.decompose()
30
31 # Remove tracking pixels (1x1 images)
32 for img in soup.find_all('img'):
33 width = img.get('width', '')
34 height = img.get('height', '')
35 if width == '1' or height == '1':
36 img.decompose()
37
38 html_content = str(soup)
39
40 # Configure html2text
41 converter = html2text.HTML2Text()
42 converter.ignore_links = False
43 converter.ignore_images = False
44 converter.body_width = 0 # No line wrapping
45 converter.unicode_snob = True
46 converter.protect_links = True
47 converter.inline_links = True
48
49 markdown = converter.handle(html_content)
50
51 # Post-processing cleanup
52 # Remove excessive blank lines
53 markdown = re.sub(r'\n{3,}', '\n\n', markdown)
54
55 # Remove trailing whitespace from lines
56 markdown = '\n'.join(line.rstrip() for line in markdown.split('\n'))
57
58 return markdown.strip()
Extracting Newsletter Categories from Subject Lines #
The Turing Post newsletters use consistent category patterns that can be extracted:6
1import re
2from dataclasses import dataclass
3from typing import Optional
4
5@dataclass
6class NewsletterCategory:
7 """Represents a newsletter category extracted from subject line."""
8 category_type: str # 'FOD', 'Topic', 'AI101', 'Superhero', 'SanFran', etc.
9 number: Optional[int] # Episode/issue number if applicable
10 title: str # The actual title after the category prefix
11 raw_subject: str # Original subject line
12
13def extract_category(subject: str) -> NewsletterCategory:
14 """
15 Extract category information from Turing Post newsletter subject lines.
16
17 Patterns recognized:
18 - FOD#64: ... -> FOD (Findings of the Day) series
19 - Topic 4: ... -> Topic series (technical deep-dives)
20 - AI 101: ... -> AI 101 series (educational)
21 - Series#5: ... -> Superhero/Agentic series (with superhero emoji)
22 - Series#77: ... -> San Francisco series (with bridge emoji)
23 - Interview: ... -> Podcast/interview episodes (with microphone emoji)
24 - Guest post: ... -> Guest contributions
25 - Webinar: ... -> Webinar announcements
26 """
27 patterns = [
28 # FOD (Findings of the Day) - e.g., "FOD#64: Golden Age..."
29 (r'^FOD#(\d+):\s*(.+)$', 'FOD'),
30
31 # Topic series - e.g., "Topic 4: What is FSDP..."
32 (r'^Topic\s+(\d+):\s*(.+)$', 'Topic'),
33
34 # AI 101 series - e.g., "AI 101: What is Continual Learning?"
35 (r'^AI\s*101:\s*(.+)$', 'AI101'),
36
37 # Superhero/Agentic series with emoji - e.g., "Series#5: Building Blocks..."
38 (r'^.{0,4}#(\d+):\s*(.+)$', 'Agentic'), # Matches emoji + #number
39
40 # San Francisco series with bridge emoji
41 (r'^.{0,4}#(\d+):\s*(.+)$', 'Insights'),
42
43 # Concepts series - e.g., "Concepts: Types of Deep Learning"
44 (r'^Concepts:\s*(.+)$', 'Concepts'),
45
46 # Guest posts - e.g., "Guest post: Why AI Databases..."
47 (r'^Guest\s+[Pp]ost:\s*(.+)$', 'GuestPost'),
48
49 # Webinars
50 (r'^\[?[Ww]ebinar\]?[:\s]+(.+)$', 'Webinar'),
51
52 # Podcast/Interview (microphone emoji)
53 (r'^.{0,4}\s*(.+)$', 'Podcast'), # Starts with microphone emoji
54 ]
55
56 # Check for specific emoji prefixes first
57 if subject.startswith('\U0001F9B8'): # Superhero emoji
58 match = re.match(r'^.+#(\d+):\s*(.+)$', subject)
59 if match:
60 return NewsletterCategory('Agentic', int(match.group(1)), match.group(2).strip(), subject)
61
62 if subject.startswith('\U0001F309'): # Bridge emoji (San Francisco)
63 match = re.match(r'^.+#(\d+):\s*(.+)$', subject)
64 if match:
65 return NewsletterCategory('Insights', int(match.group(1)), match.group(2).strip(), subject)
66
67 if subject.startswith('\U0001F399') or '\U0001F399' in subject[:5]: # Microphone emoji
68 title = re.sub(r'^[\U0001F399\uFE0F\s]+', '', subject).strip()
69 return NewsletterCategory('Podcast', None, title, subject)
70
71 # Try standard patterns
72 # FOD pattern
73 match = re.match(r'^FOD#(\d+):\s*(.+)$', subject)
74 if match:
75 return NewsletterCategory('FOD', int(match.group(1)), match.group(2).strip(), subject)
76
77 # Topic pattern
78 match = re.match(r'^Topic\s+(\d+):\s*(.+)$', subject)
79 if match:
80 return NewsletterCategory('Topic', int(match.group(1)), match.group(2).strip(), subject)
81
82 # AI 101 pattern
83 match = re.match(r'^AI\s*101:\s*(.+)$', subject)
84 if match:
85 return NewsletterCategory('AI101', None, match.group(1).strip(), subject)
86
87 # Concepts pattern
88 match = re.match(r'^Concepts:\s*(.+)$', subject)
89 if match:
90 return NewsletterCategory('Concepts', None, match.group(1).strip(), subject)
91
92 # Guest post pattern
93 match = re.match(r'^Guest\s+[Pp]ost:\s*(.+)$', subject, re.IGNORECASE)
94 if match:
95 return NewsletterCategory('GuestPost', None, match.group(1).strip(), subject)
96
97 # Webinar pattern
98 match = re.match(r'^\[?[Ww]ebinar\]?[:\s]+(.+)$', subject)
99 if match:
100 return NewsletterCategory('Webinar', None, match.group(1).strip(), subject)
101
102 # Default: uncategorized
103 return NewsletterCategory('Uncategorized', None, subject.strip(), subject)
Complete Extraction Pipeline #
1#!/usr/bin/env python3
2"""
3Complete pipeline for extracting Turing Post newsletters from .eml files
4to organized Markdown content directory.
5"""
6
7from email import policy
8from email.parser import BytesParser
9from pathlib import Path
10from datetime import datetime
11import html2text
12import re
13from typing import Optional
14from dataclasses import dataclass
15import unicodedata
16
17@dataclass
18class NewsletterCategory:
19 category_type: str
20 number: Optional[int]
21 title: str
22 raw_subject: str
23
24def slugify(text: str) -> str:
25 """Convert text to URL-friendly slug."""
26 # Normalize unicode
27 text = unicodedata.normalize('NFKD', text)
28 # Remove non-ASCII characters
29 text = text.encode('ascii', 'ignore').decode('ascii')
30 # Convert to lowercase
31 text = text.lower()
32 # Replace spaces and special chars with hyphens
33 text = re.sub(r'[^\w\s-]', '', text)
34 text = re.sub(r'[-\s]+', '-', text)
35 return text.strip('-')[:50] # Limit length
36
37def extract_category(subject: str) -> NewsletterCategory:
38 """Extract category from subject line (see previous code block)."""
39 # ... (implementation from above)
40 # Simplified version for brevity:
41
42 if 'FOD#' in subject:
43 match = re.search(r'FOD#(\d+):\s*(.+)', subject)
44 if match:
45 return NewsletterCategory('FOD', int(match.group(1)), match.group(2), subject)
46
47 if subject.startswith('Topic'):
48 match = re.match(r'Topic\s+(\d+):\s*(.+)', subject)
49 if match:
50 return NewsletterCategory('Topic', int(match.group(1)), match.group(2), subject)
51
52 if 'AI 101' in subject or 'AI101' in subject:
53 title = re.sub(r'^AI\s*101[:\s]*', '', subject)
54 return NewsletterCategory('AI101', None, title, subject)
55
56 # Check for emoji-prefixed series
57 if '\U0001F9B8' in subject: # Superhero
58 match = re.search(r'#(\d+):\s*(.+)', subject)
59 if match:
60 return NewsletterCategory('Agentic', int(match.group(1)), match.group(2), subject)
61
62 if '\U0001F309' in subject: # Bridge
63 match = re.search(r'#(\d+):\s*(.+)', subject)
64 if match:
65 return NewsletterCategory('Insights', int(match.group(1)), match.group(2), subject)
66
67 if '\U0001F399' in subject: # Microphone
68 title = re.sub(r'[\U0001F399\uFE0F\s]+', '', subject).strip()
69 return NewsletterCategory('Podcast', None, title, subject)
70
71 if 'Concepts:' in subject:
72 title = subject.split('Concepts:', 1)[1].strip()
73 return NewsletterCategory('Concepts', None, title, subject)
74
75 if 'Guest post' in subject.lower():
76 title = re.sub(r'^Guest\s+post:\s*', '', subject, flags=re.IGNORECASE)
77 return NewsletterCategory('GuestPost', None, title, subject)
78
79 return NewsletterCategory('Uncategorized', None, subject, subject)
80
81def create_markdown_converter():
82 """Create configured html2text converter."""
83 h = html2text.HTML2Text()
84 h.ignore_links = False
85 h.ignore_images = False
86 h.body_width = 0
87 h.unicode_snob = True
88 h.protect_links = True
89 h.inline_links = True
90 h.skip_internal_links = True
91 return h
92
93def parse_eml_file(file_path: Path) -> dict:
94 """Parse .eml file and extract all content."""
95 with open(file_path, 'rb') as fp:
96 msg = BytesParser(policy=policy.default).parse(fp)
97
98 # Extract date
99 date_header = msg.get('date')
100 if hasattr(date_header, 'datetime'):
101 date = date_header.datetime
102 else:
103 date = datetime.now()
104
105 # Get HTML content (preferred) or plain text
106 html_part = msg.get_body(preferencelist=('html',))
107 text_part = msg.get_body(preferencelist=('plain',))
108
109 html_content = html_part.get_content() if html_part else None
110 text_content = text_part.get_content() if text_part else None
111
112 return {
113 'subject': msg.get('subject', 'No Subject'),
114 'date': date,
115 'sender': msg.get('from', ''),
116 'html_content': html_content,
117 'text_content': text_content,
118 }
119
120def html_to_markdown(html: str) -> str:
121 """Convert HTML to clean Markdown."""
122 converter = create_markdown_converter()
123 markdown = converter.handle(html)
124
125 # Clean up excessive whitespace
126 markdown = re.sub(r'\n{3,}', '\n\n', markdown)
127 markdown = '\n'.join(line.rstrip() for line in markdown.split('\n'))
128
129 return markdown.strip()
130
131def create_frontmatter(email_data: dict, category: NewsletterCategory) -> str:
132 """Create YAML frontmatter for Markdown file."""
133 date_str = email_data['date'].strftime('%Y-%m-%d') if email_data['date'] else 'unknown'
134
135 frontmatter = f"""---
136title: "{category.title.replace('"', '\\"')}"
137date: {date_str}
138category: {category.category_type}
139"""
140 if category.number:
141 frontmatter += f"episode: {category.number}\n"
142
143 frontmatter += f"""source: Turing Post Newsletter
144original_subject: "{email_data['subject'].replace('"', '\\"')}"
145---
146
147"""
148 return frontmatter
149
150def process_newsletter(eml_path: Path, output_dir: Path) -> Path:
151 """
152 Process a single newsletter .eml file and save as Markdown.
153
154 Returns the path to the created Markdown file.
155 """
156 # Parse email
157 email_data = parse_eml_file(eml_path)
158
159 # Extract category
160 category = extract_category(email_data['subject'])
161
162 # Create category subdirectory
163 category_dir = output_dir / category.category_type.lower()
164 category_dir.mkdir(parents=True, exist_ok=True)
165
166 # Generate filename
167 date_str = email_data['date'].strftime('%Y-%m-%d') if email_data['date'] else 'unknown'
168 title_slug = slugify(category.title)
169
170 if category.number:
171 filename = f"{date_str}-{category.category_type.lower()}-{category.number:03d}-{title_slug}.md"
172 else:
173 filename = f"{date_str}-{title_slug}.md"
174
175 output_path = category_dir / filename
176
177 # Convert content to Markdown
178 if email_data['html_content']:
179 content = html_to_markdown(email_data['html_content'])
180 elif email_data['text_content']:
181 content = email_data['text_content']
182 else:
183 content = "(No content extracted)"
184
185 # Build final document
186 frontmatter = create_frontmatter(email_data, category)
187 full_document = frontmatter + f"# {category.title}\n\n" + content
188
189 # Write file
190 output_path.write_text(full_document, encoding='utf-8')
191
192 return output_path
193
194def process_all_newsletters(emails_dir: Path, output_dir: Path) -> dict:
195 """
196 Process all .eml files in directory.
197
198 Returns summary statistics.
199 """
200 eml_files = list(emails_dir.glob('*.eml'))
201
202 stats = {
203 'total': len(eml_files),
204 'processed': 0,
205 'errors': [],
206 'categories': {}
207 }
208
209 for eml_path in eml_files:
210 try:
211 output_path = process_newsletter(eml_path, output_dir)
212 stats['processed'] += 1
213
214 # Track category counts
215 category = output_path.parent.name
216 stats['categories'][category] = stats['categories'].get(category, 0) + 1
217
218 print(f"Processed: {eml_path.name} -> {output_path}")
219
220 except Exception as e:
221 stats['errors'].append((eml_path.name, str(e)))
222 print(f"Error processing {eml_path.name}: {e}")
223
224 return stats
225
226# Main execution
227if __name__ == '__main__':
228 import sys
229
230 emails_dir = Path('emails')
231 output_dir = Path('content')
232
233 if not emails_dir.exists():
234 print(f"Error: {emails_dir} directory not found")
235 sys.exit(1)
236
237 output_dir.mkdir(exist_ok=True)
238
239 print(f"Processing newsletters from {emails_dir} to {output_dir}")
240 print("-" * 60)
241
242 stats = process_all_newsletters(emails_dir, output_dir)
243
244 print("-" * 60)
245 print(f"Processed {stats['processed']}/{stats['total']} files")
246 print("\nCategories:")
247 for cat, count in sorted(stats['categories'].items()):
248 print(f" {cat}: {count}")
249
250 if stats['errors']:
251 print(f"\nErrors ({len(stats['errors'])}):")
252 for filename, error in stats['errors']:
253 print(f" {filename}: {error}")
Advanced Usage #
Handling Complex HTML Newsletters #
For newsletters with complex HTML (tables, nested divs, email-specific markup), consider pre-cleaning with BeautifulSoup:7
1from bs4 import BeautifulSoup
2import re
3
4def clean_newsletter_html(html: str) -> str:
5 """
6 Pre-clean newsletter HTML before Markdown conversion.
7 Removes email cruft while preserving content structure.
8 """
9 soup = BeautifulSoup(html, 'html.parser')
10
11 # Remove elements that don't convert well
12 for tag in soup.find_all(['style', 'script', 'head', 'meta', 'link']):
13 tag.decompose()
14
15 # Remove hidden elements
16 for el in soup.find_all(style=re.compile(r'display:\s*none', re.I)):
17 el.decompose()
18
19 # Remove tracking pixels and spacer images
20 for img in soup.find_all('img'):
21 src = img.get('src', '')
22 width = img.get('width', '')
23 height = img.get('height', '')
24 alt = img.get('alt', '')
25
26 # Remove 1x1 tracking pixels
27 if width == '1' or height == '1':
28 img.decompose()
29 continue
30
31 # Remove spacer GIFs
32 if 'spacer' in src.lower() or 'blank' in src.lower():
33 img.decompose()
34 continue
35
36 # Keep images with meaningful alt text or content
37 if not alt and not src:
38 img.decompose()
39
40 # Remove empty paragraphs
41 for p in soup.find_all('p'):
42 if not p.get_text(strip=True) and not p.find('img'):
43 p.decompose()
44
45 # Remove newsletter footer boilerplate (common patterns)
46 footer_patterns = [
47 'unsubscribe',
48 'manage your preferences',
49 'update subscription',
50 'you are receiving this',
51 'this email was sent to',
52 'view in browser',
53 ]
54
55 for pattern in footer_patterns:
56 for el in soup.find_all(string=re.compile(pattern, re.I)):
57 # Find parent container and remove it
58 parent = el.find_parent(['div', 'td', 'tr', 'table', 'p'])
59 if parent:
60 parent.decompose()
61
62 # Simplify nested tables (common in email templates)
63 # Replace single-cell tables with their content
64 for table in soup.find_all('table'):
65 cells = table.find_all(['td', 'th'])
66 if len(cells) == 1:
67 table.replace_with(cells[0])
68
69 return str(soup)
70
71def enhanced_html_to_markdown(html: str) -> str:
72 """Convert HTML to Markdown with pre-cleaning."""
73 cleaned = clean_newsletter_html(html)
74 return html_to_markdown(cleaned)
Extracting and Preserving Images #
1import hashlib
2from urllib.parse import urlparse
3import requests
4
5def extract_images(html: str, output_dir: Path) -> dict:
6 """
7 Extract image URLs from HTML and optionally download them.
8 Returns a mapping of original URLs to local paths.
9 """
10 soup = BeautifulSoup(html, 'html.parser')
11 image_map = {}
12
13 images_dir = output_dir / 'images'
14 images_dir.mkdir(exist_ok=True)
15
16 for img in soup.find_all('img'):
17 src = img.get('src', '')
18 if not src or src.startswith('data:'):
19 continue
20
21 # Generate local filename from URL hash
22 url_hash = hashlib.md5(src.encode()).hexdigest()[:12]
23 ext = Path(urlparse(src).path).suffix or '.png'
24 local_name = f"{url_hash}{ext}"
25 local_path = images_dir / local_name
26
27 # Download if not already present
28 if not local_path.exists():
29 try:
30 response = requests.get(src, timeout=10)
31 response.raise_for_status()
32 local_path.write_bytes(response.content)
33 except Exception as e:
34 print(f"Failed to download {src}: {e}")
35 continue
36
37 image_map[src] = f"images/{local_name}"
38
39 return image_map
40
41def replace_image_urls(markdown: str, image_map: dict) -> str:
42 """Replace remote image URLs with local paths in Markdown."""
43 for remote_url, local_path in image_map.items():
44 markdown = markdown.replace(remote_url, local_path)
45 return markdown
Alternatives Considered #
| Library/Stack | Simplicity | Popularity | Support | Why Not Chosen |
|---|---|---|---|---|
| Node.js mailparser + turndown | Medium | mailparser: 1.3M/week, turndown: 2M/week48 | Active | Adds Node.js runtime dependency; overkill for batch processing; better for streaming/real-time |
| Python markdownify | High | ~100K/week | Active | Less control over output format; html2text more battle-tested |
| Deno + postal-mime | Medium | postal-mime: ~65K/week9 | Active | Ecosystem less mature; postal-mime designed for browser/serverless, not batch |
| Python eml_parser | Medium | Lower adoption | Limited | Focuses on forensics/metadata; html2text better for content extraction |
Node.js Alternative: When to Use It #
If you need streaming processing, real-time email handling, or are already in a Node.js ecosystem, the mailparser + turndown combination is excellent:48
1// Node.js alternative implementation
2const { simpleParser } = require('mailparser');
3const TurndownService = require('turndown');
4const fs = require('fs').promises;
5
6const turndown = new TurndownService({
7 headingStyle: 'atx',
8 codeBlockStyle: 'fenced',
9 bulletListMarker: '-'
10});
11
12async function parseEml(filePath) {
13 const emlContent = await fs.readFile(filePath);
14 const parsed = await simpleParser(emlContent);
15
16 return {
17 subject: parsed.subject,
18 date: parsed.date,
19 from: parsed.from?.text,
20 html: parsed.html,
21 text: parsed.text
22 };
23}
24
25function htmlToMarkdown(html) {
26 return turndown.turndown(html);
27}
28
29// Usage
30const email = await parseEml('emails/newsletter.eml');
31const markdown = htmlToMarkdown(email.html);
Caveats and Limitations #
When This Is NOT the Right Choice #
-
Streaming/Real-time Processing: Python's email module loads entire messages into memory. For processing email streams or very large volumes in real-time, Node.js mailparser with streaming is better.
-
Browser/Serverless Environment: Python is not suitable for browser environments. Use PostalMime for Cloudflare Workers, browser extensions, or front-end applications.9
-
Preserving Exact HTML Structure: html2text is lossy - it converts to Markdown, not HTML. If you need to preserve exact HTML fidelity for round-tripping, use a different approach.
-
Complex Table Layouts: html2text has basic table support but struggles with complex nested tables common in email templates. May require pre-processing to simplify.
Known Limitations #
-
html2text GPL License: html2text is distributed under GPLv3. Check compatibility with your licensing requirements.2
-
Image Handling: Images are converted to Markdown syntax
but not downloaded by default. Implement separate image extraction if needed. -
Email Template Cruft: Newsletter HTML often contains significant boilerplate (tracking pixels, MSO conditionals, nested tables). Pre-cleaning with BeautifulSoup recommended.
-
Emoji in Subject Lines: The category extraction regex needs careful handling of Unicode emoji characters. Test thoroughly with your actual data.
-
Date Parsing Edge Cases: Some emails may have malformed date headers. Implement fallback to file modification time or extract from filename.
Content Directory Structure #
The recommended output structure for 218 newsletters:
content/
├── fod/ # Findings of the Day series
│ ├── 2024-08-26-fod-064-golden-age-for-indie-devs.md
│ └── ...
├── topic/ # Topic deep-dives
│ ├── 2024-09-15-topic-004-what-is-fsdp-and-yafsdp.md
│ └── ...
├── ai101/ # AI 101 educational series
│ └── ...
├── agentic/ # Superhero/Agentic series
│ └── ...
├── insights/ # San Francisco/Insights series
│ └── ...
├── podcast/ # Interview/podcast episodes
│ └── ...
├── concepts/ # Concepts series
│ └── ...
├── guestpost/ # Guest contributions
│ └── ...
├── webinar/ # Webinar announcements
│ └── ...
└── uncategorized/ # Emails not matching patterns
└── ...
Bibliography #
Additional Resources #
- Official Python email Documentation
- html2text Options Reference
- BeautifulSoup Documentation
- Turndown GitHub Repository
- RFC 822 - Email Message Format
-
html2text on PyPI - Python package for HTML to Markdown conversion, ~556K weekly downloads ↩︎ ↩︎
-
GitHub - Alir3z4/html2text - Official repository, 2.1K stars, GPLv3 license ↩︎ ↩︎ ↩︎
-
Python email.parser documentation - Official Python standard library documentation ↩︎
-
mailparser on npm - Node.js email parser, ~1.3M weekly downloads, 1.6K GitHub stars ↩︎ ↩︎ ↩︎
-
Stack Overflow: Parsing EML files - Community discussion on EML parsing approaches ↩︎
-
Sample analysis of Turing Post newsletter subject line patterns from provided .eml files ↩︎
-
Converting HTML to Markdown with Python - Comprehensive Guide - Best practices for HTML pre-cleaning ↩︎
-
Turndown on npm - JavaScript HTML to Markdown converter, ~2M weekly downloads, 10.4K GitHub stars ↩︎ ↩︎
-
postal-mime on npm - Browser/serverless email parser, ~65K weekly downloads ↩︎ ↩︎