Extracting Content from EML Files to Structured Markdown

· combray's blog


title: Extracting Content from EML Files to Structured Markdown date: 2025-11-30 topic: EML email parsing and HTML-to-Markdown conversion recommendation: Python with email stdlib + html2text project-context: 218 .eml newsletter files requiring batch extraction # When to Use - Batch processing of .eml files from newsletter archives - Converting MIME-encoded emails to readable Markdown - Extracting metadata (date, subject, sender) alongside content - Creating structured content directories from email archives - Processing newsletters with mixed HTML/plain text content # When NOT to Use - Real-time email processing requiring streaming (use Node.js mailparser) - Browser-based email parsing (use PostalMime) - When you need to preserve exact HTML fidelity (use markdownify with custom rules) - Processing extremely large attachments (100MB+) where memory is a concern

Executive Summary #

This report evaluates approaches for extracting content from 218 MIME-encoded .eml newsletter files and organizing them into a structured Markdown content directory. After analyzing the email structure (multipart MIME with quoted-printable encoding, UTF-8 characters, and embedded emojis), I evaluated four primary approaches: Python with the email stdlib plus html2text, Node.js with mailparser plus turndown, Deno with postal-mime, and alternative Python combinations.

Recommendation: Python with the standard library email module plus html2text for HTML-to-Markdown conversion. This combination offers the best balance of simplicity (stdlib handles MIME parsing natively), reliability (handles quoted-printable and UTF-8 automatically), and ecosystem maturity. Python's email module has 556K+ weekly PyPI downloads for html2text1, battle-tested documentation, and requires minimal dependencies. The Node.js alternative (mailparser + turndown) is excellent but adds complexity for a batch processing task that does not require streaming or real-time capabilities.

The newsletters follow predictable category patterns (FOD#, Topic #, AI 101, etc.) that can be extracted via regex from subject lines, enabling automatic organization into content subdirectories.

Recommendation: Python email + html2text #

Why This Choice #

Python's standard library email module provides native MIME parsing without external dependencies, while html2text (originally by Aaron Swartz) offers mature, configurable HTML-to-Markdown conversion.23

Simplicity: The Python email module is part of the standard library - no installation required for parsing. The BytesParser with policy.default handles multipart messages, quoted-printable decoding, and charset conversion automatically. html2text installation is a single pip install html2text.

Popularity:

Support: Both libraries are actively maintained. html2text had a release in April 2025. Python's email module receives ongoing maintenance as part of the standard library.

Philosophy and Core Concepts #

Python's email library follows the "batteries included" philosophy - MIME parsing should be straightforward without external dependencies. The library provides two interfaces:5

  1. Parser API: Load entire messages from strings, bytes, or files
  2. FeedParser API: Incremental parsing for streaming scenarios

For batch processing .eml files, the Parser API is ideal. The key insight is using policy.default which provides modern, UTF-8-aware parsing behavior.

html2text converts HTML to Markdown by traversing the DOM and applying formatting rules. It aims to produce "clean, easy-to-read plain ASCII text" that is also valid Markdown.2 The library handles:

Getting Started #

Installation #

1# Create virtual environment (recommended)
2python -m venv venv
3source venv/bin/activate  # On Windows: venv\Scripts\activate
4
5# Install html2text
6pip install html2text
7
8# Optional: Install beautifulsoup4 for HTML pre-cleaning
9pip install beautifulsoup4

Basic Setup #

 1# eml_parser.py
 2from email import policy
 3from email.parser import BytesParser
 4import html2text
 5from pathlib import Path
 6from datetime import datetime
 7import re
 8
 9# Configure html2text converter
10def create_converter():
11    """Create and configure html2text converter with optimal settings."""
12    h = html2text.HTML2Text()
13    h.ignore_links = False          # Keep links as Markdown
14    h.ignore_images = False         # Keep image references
15    h.ignore_emphasis = False       # Keep bold/italic
16    h.body_width = 0                # Don't wrap lines (let Markdown handle it)
17    h.unicode_snob = True           # Use Unicode characters
18    h.skip_internal_links = True    # Skip anchor links
19    h.inline_links = True           # Use inline link style [text](url)
20    h.protect_links = True          # Don't wrap URLs
21    h.ignore_tables = False         # Convert tables
22    h.single_line_break = False     # Use double line breaks for paragraphs
23    return h

Usage Guide #

Basic Usage: Parsing a Single EML File #

 1from email import policy
 2from email.parser import BytesParser
 3from pathlib import Path
 4
 5def parse_eml(file_path: Path) -> dict:
 6    """
 7    Parse an .eml file and extract metadata and content.
 8
 9    Returns a dict with:
10    - subject: Email subject line
11    - date: Parsed datetime object
12    - sender: From address
13    - text_content: Plain text body (if available)
14    - html_content: HTML body (if available)
15    """
16    with open(file_path, 'rb') as fp:
17        # Use policy.default for modern, UTF-8-aware parsing
18        msg = BytesParser(policy=policy.default).parse(fp)
19
20    result = {
21        'subject': msg.get('subject', ''),
22        'date': msg.get('date'),  # Returns datetime object with policy.default
23        'sender': msg.get('from', ''),
24        'text_content': None,
25        'html_content': None,
26    }
27
28    # Extract body content
29    # get_body() returns the best candidate for the message body
30    # preferencelist determines priority order
31
32    # Try to get plain text first
33    text_part = msg.get_body(preferencelist=('plain',))
34    if text_part:
35        result['text_content'] = text_part.get_content()
36
37    # Also get HTML for conversion
38    html_part = msg.get_body(preferencelist=('html',))
39    if html_part:
40        result['html_content'] = html_part.get_content()
41
42    return result
43
44# Example usage
45email_data = parse_eml(Path('emails/FOD#64_ Golden Age for Indie Devs and Engineers.eml'))
46print(f"Subject: {email_data['subject']}")
47print(f"Date: {email_data['date']}")

Converting HTML to Markdown #

 1import html2text
 2from bs4 import BeautifulSoup  # Optional, for pre-cleaning
 3
 4def html_to_markdown(html_content: str, clean_first: bool = True) -> str:
 5    """
 6    Convert HTML email content to clean Markdown.
 7
 8    Args:
 9        html_content: Raw HTML string
10        clean_first: Whether to pre-clean HTML with BeautifulSoup
11
12    Returns:
13        Markdown-formatted string
14    """
15    if clean_first:
16        # Pre-clean HTML to remove email-specific cruft
17        soup = BeautifulSoup(html_content, 'html.parser')
18
19        # Remove style tags
20        for style in soup.find_all('style'):
21            style.decompose()
22
23        # Remove script tags (shouldn't be in emails, but safety first)
24        for script in soup.find_all('script'):
25            script.decompose()
26
27        # Remove hidden elements
28        for hidden in soup.find_all(style=re.compile(r'display:\s*none')):
29            hidden.decompose()
30
31        # Remove tracking pixels (1x1 images)
32        for img in soup.find_all('img'):
33            width = img.get('width', '')
34            height = img.get('height', '')
35            if width == '1' or height == '1':
36                img.decompose()
37
38        html_content = str(soup)
39
40    # Configure html2text
41    converter = html2text.HTML2Text()
42    converter.ignore_links = False
43    converter.ignore_images = False
44    converter.body_width = 0  # No line wrapping
45    converter.unicode_snob = True
46    converter.protect_links = True
47    converter.inline_links = True
48
49    markdown = converter.handle(html_content)
50
51    # Post-processing cleanup
52    # Remove excessive blank lines
53    markdown = re.sub(r'\n{3,}', '\n\n', markdown)
54
55    # Remove trailing whitespace from lines
56    markdown = '\n'.join(line.rstrip() for line in markdown.split('\n'))
57
58    return markdown.strip()

Extracting Newsletter Categories from Subject Lines #

The Turing Post newsletters use consistent category patterns that can be extracted:6

  1import re
  2from dataclasses import dataclass
  3from typing import Optional
  4
  5@dataclass
  6class NewsletterCategory:
  7    """Represents a newsletter category extracted from subject line."""
  8    category_type: str      # 'FOD', 'Topic', 'AI101', 'Superhero', 'SanFran', etc.
  9    number: Optional[int]   # Episode/issue number if applicable
 10    title: str              # The actual title after the category prefix
 11    raw_subject: str        # Original subject line
 12
 13def extract_category(subject: str) -> NewsletterCategory:
 14    """
 15    Extract category information from Turing Post newsletter subject lines.
 16
 17    Patterns recognized:
 18    - FOD#64: ... -> FOD (Findings of the Day) series
 19    - Topic 4: ... -> Topic series (technical deep-dives)
 20    - AI 101: ... -> AI 101 series (educational)
 21    - Series#5: ... -> Superhero/Agentic series (with superhero emoji)
 22    - Series#77: ... -> San Francisco series (with bridge emoji)
 23    - Interview: ... -> Podcast/interview episodes (with microphone emoji)
 24    - Guest post: ... -> Guest contributions
 25    - Webinar: ... -> Webinar announcements
 26    """
 27    patterns = [
 28        # FOD (Findings of the Day) - e.g., "FOD#64: Golden Age..."
 29        (r'^FOD#(\d+):\s*(.+)$', 'FOD'),
 30
 31        # Topic series - e.g., "Topic 4: What is FSDP..."
 32        (r'^Topic\s+(\d+):\s*(.+)$', 'Topic'),
 33
 34        # AI 101 series - e.g., "AI 101: What is Continual Learning?"
 35        (r'^AI\s*101:\s*(.+)$', 'AI101'),
 36
 37        # Superhero/Agentic series with emoji - e.g., "Series#5: Building Blocks..."
 38        (r'^.{0,4}#(\d+):\s*(.+)$', 'Agentic'),  # Matches emoji + #number
 39
 40        # San Francisco series with bridge emoji
 41        (r'^.{0,4}#(\d+):\s*(.+)$', 'Insights'),
 42
 43        # Concepts series - e.g., "Concepts: Types of Deep Learning"
 44        (r'^Concepts:\s*(.+)$', 'Concepts'),
 45
 46        # Guest posts - e.g., "Guest post: Why AI Databases..."
 47        (r'^Guest\s+[Pp]ost:\s*(.+)$', 'GuestPost'),
 48
 49        # Webinars
 50        (r'^\[?[Ww]ebinar\]?[:\s]+(.+)$', 'Webinar'),
 51
 52        # Podcast/Interview (microphone emoji)
 53        (r'^.{0,4}\s*(.+)$', 'Podcast'),  # Starts with microphone emoji
 54    ]
 55
 56    # Check for specific emoji prefixes first
 57    if subject.startswith('\U0001F9B8'):  # Superhero emoji
 58        match = re.match(r'^.+#(\d+):\s*(.+)$', subject)
 59        if match:
 60            return NewsletterCategory('Agentic', int(match.group(1)), match.group(2).strip(), subject)
 61
 62    if subject.startswith('\U0001F309'):  # Bridge emoji (San Francisco)
 63        match = re.match(r'^.+#(\d+):\s*(.+)$', subject)
 64        if match:
 65            return NewsletterCategory('Insights', int(match.group(1)), match.group(2).strip(), subject)
 66
 67    if subject.startswith('\U0001F399') or '\U0001F399' in subject[:5]:  # Microphone emoji
 68        title = re.sub(r'^[\U0001F399\uFE0F\s]+', '', subject).strip()
 69        return NewsletterCategory('Podcast', None, title, subject)
 70
 71    # Try standard patterns
 72    # FOD pattern
 73    match = re.match(r'^FOD#(\d+):\s*(.+)$', subject)
 74    if match:
 75        return NewsletterCategory('FOD', int(match.group(1)), match.group(2).strip(), subject)
 76
 77    # Topic pattern
 78    match = re.match(r'^Topic\s+(\d+):\s*(.+)$', subject)
 79    if match:
 80        return NewsletterCategory('Topic', int(match.group(1)), match.group(2).strip(), subject)
 81
 82    # AI 101 pattern
 83    match = re.match(r'^AI\s*101:\s*(.+)$', subject)
 84    if match:
 85        return NewsletterCategory('AI101', None, match.group(1).strip(), subject)
 86
 87    # Concepts pattern
 88    match = re.match(r'^Concepts:\s*(.+)$', subject)
 89    if match:
 90        return NewsletterCategory('Concepts', None, match.group(1).strip(), subject)
 91
 92    # Guest post pattern
 93    match = re.match(r'^Guest\s+[Pp]ost:\s*(.+)$', subject, re.IGNORECASE)
 94    if match:
 95        return NewsletterCategory('GuestPost', None, match.group(1).strip(), subject)
 96
 97    # Webinar pattern
 98    match = re.match(r'^\[?[Ww]ebinar\]?[:\s]+(.+)$', subject)
 99    if match:
100        return NewsletterCategory('Webinar', None, match.group(1).strip(), subject)
101
102    # Default: uncategorized
103    return NewsletterCategory('Uncategorized', None, subject.strip(), subject)

Complete Extraction Pipeline #

  1#!/usr/bin/env python3
  2"""
  3Complete pipeline for extracting Turing Post newsletters from .eml files
  4to organized Markdown content directory.
  5"""
  6
  7from email import policy
  8from email.parser import BytesParser
  9from pathlib import Path
 10from datetime import datetime
 11import html2text
 12import re
 13from typing import Optional
 14from dataclasses import dataclass
 15import unicodedata
 16
 17@dataclass
 18class NewsletterCategory:
 19    category_type: str
 20    number: Optional[int]
 21    title: str
 22    raw_subject: str
 23
 24def slugify(text: str) -> str:
 25    """Convert text to URL-friendly slug."""
 26    # Normalize unicode
 27    text = unicodedata.normalize('NFKD', text)
 28    # Remove non-ASCII characters
 29    text = text.encode('ascii', 'ignore').decode('ascii')
 30    # Convert to lowercase
 31    text = text.lower()
 32    # Replace spaces and special chars with hyphens
 33    text = re.sub(r'[^\w\s-]', '', text)
 34    text = re.sub(r'[-\s]+', '-', text)
 35    return text.strip('-')[:50]  # Limit length
 36
 37def extract_category(subject: str) -> NewsletterCategory:
 38    """Extract category from subject line (see previous code block)."""
 39    # ... (implementation from above)
 40    # Simplified version for brevity:
 41
 42    if 'FOD#' in subject:
 43        match = re.search(r'FOD#(\d+):\s*(.+)', subject)
 44        if match:
 45            return NewsletterCategory('FOD', int(match.group(1)), match.group(2), subject)
 46
 47    if subject.startswith('Topic'):
 48        match = re.match(r'Topic\s+(\d+):\s*(.+)', subject)
 49        if match:
 50            return NewsletterCategory('Topic', int(match.group(1)), match.group(2), subject)
 51
 52    if 'AI 101' in subject or 'AI101' in subject:
 53        title = re.sub(r'^AI\s*101[:\s]*', '', subject)
 54        return NewsletterCategory('AI101', None, title, subject)
 55
 56    # Check for emoji-prefixed series
 57    if '\U0001F9B8' in subject:  # Superhero
 58        match = re.search(r'#(\d+):\s*(.+)', subject)
 59        if match:
 60            return NewsletterCategory('Agentic', int(match.group(1)), match.group(2), subject)
 61
 62    if '\U0001F309' in subject:  # Bridge
 63        match = re.search(r'#(\d+):\s*(.+)', subject)
 64        if match:
 65            return NewsletterCategory('Insights', int(match.group(1)), match.group(2), subject)
 66
 67    if '\U0001F399' in subject:  # Microphone
 68        title = re.sub(r'[\U0001F399\uFE0F\s]+', '', subject).strip()
 69        return NewsletterCategory('Podcast', None, title, subject)
 70
 71    if 'Concepts:' in subject:
 72        title = subject.split('Concepts:', 1)[1].strip()
 73        return NewsletterCategory('Concepts', None, title, subject)
 74
 75    if 'Guest post' in subject.lower():
 76        title = re.sub(r'^Guest\s+post:\s*', '', subject, flags=re.IGNORECASE)
 77        return NewsletterCategory('GuestPost', None, title, subject)
 78
 79    return NewsletterCategory('Uncategorized', None, subject, subject)
 80
 81def create_markdown_converter():
 82    """Create configured html2text converter."""
 83    h = html2text.HTML2Text()
 84    h.ignore_links = False
 85    h.ignore_images = False
 86    h.body_width = 0
 87    h.unicode_snob = True
 88    h.protect_links = True
 89    h.inline_links = True
 90    h.skip_internal_links = True
 91    return h
 92
 93def parse_eml_file(file_path: Path) -> dict:
 94    """Parse .eml file and extract all content."""
 95    with open(file_path, 'rb') as fp:
 96        msg = BytesParser(policy=policy.default).parse(fp)
 97
 98    # Extract date
 99    date_header = msg.get('date')
100    if hasattr(date_header, 'datetime'):
101        date = date_header.datetime
102    else:
103        date = datetime.now()
104
105    # Get HTML content (preferred) or plain text
106    html_part = msg.get_body(preferencelist=('html',))
107    text_part = msg.get_body(preferencelist=('plain',))
108
109    html_content = html_part.get_content() if html_part else None
110    text_content = text_part.get_content() if text_part else None
111
112    return {
113        'subject': msg.get('subject', 'No Subject'),
114        'date': date,
115        'sender': msg.get('from', ''),
116        'html_content': html_content,
117        'text_content': text_content,
118    }
119
120def html_to_markdown(html: str) -> str:
121    """Convert HTML to clean Markdown."""
122    converter = create_markdown_converter()
123    markdown = converter.handle(html)
124
125    # Clean up excessive whitespace
126    markdown = re.sub(r'\n{3,}', '\n\n', markdown)
127    markdown = '\n'.join(line.rstrip() for line in markdown.split('\n'))
128
129    return markdown.strip()
130
131def create_frontmatter(email_data: dict, category: NewsletterCategory) -> str:
132    """Create YAML frontmatter for Markdown file."""
133    date_str = email_data['date'].strftime('%Y-%m-%d') if email_data['date'] else 'unknown'
134
135    frontmatter = f"""---
136title: "{category.title.replace('"', '\\"')}"
137date: {date_str}
138category: {category.category_type}
139"""
140    if category.number:
141        frontmatter += f"episode: {category.number}\n"
142
143    frontmatter += f"""source: Turing Post Newsletter
144original_subject: "{email_data['subject'].replace('"', '\\"')}"
145---
146
147"""
148    return frontmatter
149
150def process_newsletter(eml_path: Path, output_dir: Path) -> Path:
151    """
152    Process a single newsletter .eml file and save as Markdown.
153
154    Returns the path to the created Markdown file.
155    """
156    # Parse email
157    email_data = parse_eml_file(eml_path)
158
159    # Extract category
160    category = extract_category(email_data['subject'])
161
162    # Create category subdirectory
163    category_dir = output_dir / category.category_type.lower()
164    category_dir.mkdir(parents=True, exist_ok=True)
165
166    # Generate filename
167    date_str = email_data['date'].strftime('%Y-%m-%d') if email_data['date'] else 'unknown'
168    title_slug = slugify(category.title)
169
170    if category.number:
171        filename = f"{date_str}-{category.category_type.lower()}-{category.number:03d}-{title_slug}.md"
172    else:
173        filename = f"{date_str}-{title_slug}.md"
174
175    output_path = category_dir / filename
176
177    # Convert content to Markdown
178    if email_data['html_content']:
179        content = html_to_markdown(email_data['html_content'])
180    elif email_data['text_content']:
181        content = email_data['text_content']
182    else:
183        content = "(No content extracted)"
184
185    # Build final document
186    frontmatter = create_frontmatter(email_data, category)
187    full_document = frontmatter + f"# {category.title}\n\n" + content
188
189    # Write file
190    output_path.write_text(full_document, encoding='utf-8')
191
192    return output_path
193
194def process_all_newsletters(emails_dir: Path, output_dir: Path) -> dict:
195    """
196    Process all .eml files in directory.
197
198    Returns summary statistics.
199    """
200    eml_files = list(emails_dir.glob('*.eml'))
201
202    stats = {
203        'total': len(eml_files),
204        'processed': 0,
205        'errors': [],
206        'categories': {}
207    }
208
209    for eml_path in eml_files:
210        try:
211            output_path = process_newsletter(eml_path, output_dir)
212            stats['processed'] += 1
213
214            # Track category counts
215            category = output_path.parent.name
216            stats['categories'][category] = stats['categories'].get(category, 0) + 1
217
218            print(f"Processed: {eml_path.name} -> {output_path}")
219
220        except Exception as e:
221            stats['errors'].append((eml_path.name, str(e)))
222            print(f"Error processing {eml_path.name}: {e}")
223
224    return stats
225
226# Main execution
227if __name__ == '__main__':
228    import sys
229
230    emails_dir = Path('emails')
231    output_dir = Path('content')
232
233    if not emails_dir.exists():
234        print(f"Error: {emails_dir} directory not found")
235        sys.exit(1)
236
237    output_dir.mkdir(exist_ok=True)
238
239    print(f"Processing newsletters from {emails_dir} to {output_dir}")
240    print("-" * 60)
241
242    stats = process_all_newsletters(emails_dir, output_dir)
243
244    print("-" * 60)
245    print(f"Processed {stats['processed']}/{stats['total']} files")
246    print("\nCategories:")
247    for cat, count in sorted(stats['categories'].items()):
248        print(f"  {cat}: {count}")
249
250    if stats['errors']:
251        print(f"\nErrors ({len(stats['errors'])}):")
252        for filename, error in stats['errors']:
253            print(f"  {filename}: {error}")

Advanced Usage #

Handling Complex HTML Newsletters #

For newsletters with complex HTML (tables, nested divs, email-specific markup), consider pre-cleaning with BeautifulSoup:7

 1from bs4 import BeautifulSoup
 2import re
 3
 4def clean_newsletter_html(html: str) -> str:
 5    """
 6    Pre-clean newsletter HTML before Markdown conversion.
 7    Removes email cruft while preserving content structure.
 8    """
 9    soup = BeautifulSoup(html, 'html.parser')
10
11    # Remove elements that don't convert well
12    for tag in soup.find_all(['style', 'script', 'head', 'meta', 'link']):
13        tag.decompose()
14
15    # Remove hidden elements
16    for el in soup.find_all(style=re.compile(r'display:\s*none', re.I)):
17        el.decompose()
18
19    # Remove tracking pixels and spacer images
20    for img in soup.find_all('img'):
21        src = img.get('src', '')
22        width = img.get('width', '')
23        height = img.get('height', '')
24        alt = img.get('alt', '')
25
26        # Remove 1x1 tracking pixels
27        if width == '1' or height == '1':
28            img.decompose()
29            continue
30
31        # Remove spacer GIFs
32        if 'spacer' in src.lower() or 'blank' in src.lower():
33            img.decompose()
34            continue
35
36        # Keep images with meaningful alt text or content
37        if not alt and not src:
38            img.decompose()
39
40    # Remove empty paragraphs
41    for p in soup.find_all('p'):
42        if not p.get_text(strip=True) and not p.find('img'):
43            p.decompose()
44
45    # Remove newsletter footer boilerplate (common patterns)
46    footer_patterns = [
47        'unsubscribe',
48        'manage your preferences',
49        'update subscription',
50        'you are receiving this',
51        'this email was sent to',
52        'view in browser',
53    ]
54
55    for pattern in footer_patterns:
56        for el in soup.find_all(string=re.compile(pattern, re.I)):
57            # Find parent container and remove it
58            parent = el.find_parent(['div', 'td', 'tr', 'table', 'p'])
59            if parent:
60                parent.decompose()
61
62    # Simplify nested tables (common in email templates)
63    # Replace single-cell tables with their content
64    for table in soup.find_all('table'):
65        cells = table.find_all(['td', 'th'])
66        if len(cells) == 1:
67            table.replace_with(cells[0])
68
69    return str(soup)
70
71def enhanced_html_to_markdown(html: str) -> str:
72    """Convert HTML to Markdown with pre-cleaning."""
73    cleaned = clean_newsletter_html(html)
74    return html_to_markdown(cleaned)

Extracting and Preserving Images #

 1import hashlib
 2from urllib.parse import urlparse
 3import requests
 4
 5def extract_images(html: str, output_dir: Path) -> dict:
 6    """
 7    Extract image URLs from HTML and optionally download them.
 8    Returns a mapping of original URLs to local paths.
 9    """
10    soup = BeautifulSoup(html, 'html.parser')
11    image_map = {}
12
13    images_dir = output_dir / 'images'
14    images_dir.mkdir(exist_ok=True)
15
16    for img in soup.find_all('img'):
17        src = img.get('src', '')
18        if not src or src.startswith('data:'):
19            continue
20
21        # Generate local filename from URL hash
22        url_hash = hashlib.md5(src.encode()).hexdigest()[:12]
23        ext = Path(urlparse(src).path).suffix or '.png'
24        local_name = f"{url_hash}{ext}"
25        local_path = images_dir / local_name
26
27        # Download if not already present
28        if not local_path.exists():
29            try:
30                response = requests.get(src, timeout=10)
31                response.raise_for_status()
32                local_path.write_bytes(response.content)
33            except Exception as e:
34                print(f"Failed to download {src}: {e}")
35                continue
36
37        image_map[src] = f"images/{local_name}"
38
39    return image_map
40
41def replace_image_urls(markdown: str, image_map: dict) -> str:
42    """Replace remote image URLs with local paths in Markdown."""
43    for remote_url, local_path in image_map.items():
44        markdown = markdown.replace(remote_url, local_path)
45    return markdown

Alternatives Considered #

Library/Stack Simplicity Popularity Support Why Not Chosen
Node.js mailparser + turndown Medium mailparser: 1.3M/week, turndown: 2M/week48 Active Adds Node.js runtime dependency; overkill for batch processing; better for streaming/real-time
Python markdownify High ~100K/week Active Less control over output format; html2text more battle-tested
Deno + postal-mime Medium postal-mime: ~65K/week9 Active Ecosystem less mature; postal-mime designed for browser/serverless, not batch
Python eml_parser Medium Lower adoption Limited Focuses on forensics/metadata; html2text better for content extraction

Node.js Alternative: When to Use It #

If you need streaming processing, real-time email handling, or are already in a Node.js ecosystem, the mailparser + turndown combination is excellent:48

 1// Node.js alternative implementation
 2const { simpleParser } = require('mailparser');
 3const TurndownService = require('turndown');
 4const fs = require('fs').promises;
 5
 6const turndown = new TurndownService({
 7  headingStyle: 'atx',
 8  codeBlockStyle: 'fenced',
 9  bulletListMarker: '-'
10});
11
12async function parseEml(filePath) {
13  const emlContent = await fs.readFile(filePath);
14  const parsed = await simpleParser(emlContent);
15
16  return {
17    subject: parsed.subject,
18    date: parsed.date,
19    from: parsed.from?.text,
20    html: parsed.html,
21    text: parsed.text
22  };
23}
24
25function htmlToMarkdown(html) {
26  return turndown.turndown(html);
27}
28
29// Usage
30const email = await parseEml('emails/newsletter.eml');
31const markdown = htmlToMarkdown(email.html);

Caveats and Limitations #

When This Is NOT the Right Choice #

Known Limitations #

  1. html2text GPL License: html2text is distributed under GPLv3. Check compatibility with your licensing requirements.2

  2. Image Handling: Images are converted to Markdown syntax ![alt](url) but not downloaded by default. Implement separate image extraction if needed.

  3. Email Template Cruft: Newsletter HTML often contains significant boilerplate (tracking pixels, MSO conditionals, nested tables). Pre-cleaning with BeautifulSoup recommended.

  4. Emoji in Subject Lines: The category extraction regex needs careful handling of Unicode emoji characters. Test thoroughly with your actual data.

  5. Date Parsing Edge Cases: Some emails may have malformed date headers. Implement fallback to file modification time or extract from filename.

Content Directory Structure #

The recommended output structure for 218 newsletters:

content/
├── fod/                    # Findings of the Day series
│   ├── 2024-08-26-fod-064-golden-age-for-indie-devs.md
│   └── ...
├── topic/                  # Topic deep-dives
│   ├── 2024-09-15-topic-004-what-is-fsdp-and-yafsdp.md
│   └── ...
├── ai101/                  # AI 101 educational series
│   └── ...
├── agentic/                # Superhero/Agentic series
│   └── ...
├── insights/               # San Francisco/Insights series
│   └── ...
├── podcast/                # Interview/podcast episodes
│   └── ...
├── concepts/               # Concepts series
│   └── ...
├── guestpost/              # Guest contributions
│   └── ...
├── webinar/                # Webinar announcements
│   └── ...
└── uncategorized/          # Emails not matching patterns
    └── ...

Bibliography #

Additional Resources #


  1. html2text on PyPI - Python package for HTML to Markdown conversion, ~556K weekly downloads ↩︎ ↩︎

  2. GitHub - Alir3z4/html2text - Official repository, 2.1K stars, GPLv3 license ↩︎ ↩︎ ↩︎

  3. Python email.parser documentation - Official Python standard library documentation ↩︎

  4. mailparser on npm - Node.js email parser, ~1.3M weekly downloads, 1.6K GitHub stars ↩︎ ↩︎ ↩︎

  5. Stack Overflow: Parsing EML files - Community discussion on EML parsing approaches ↩︎

  6. Sample analysis of Turing Post newsletter subject line patterns from provided .eml files ↩︎

  7. Converting HTML to Markdown with Python - Comprehensive Guide - Best practices for HTML pre-cleaning ↩︎

  8. Turndown on npm - JavaScript HTML to Markdown converter, ~2M weekly downloads, 10.4K GitHub stars ↩︎ ↩︎

  9. postal-mime on npm - Browser/serverless email parser, ~65K weekly downloads ↩︎ ↩︎

last updated: