Python BeautifulSoup: Extract Paragraph Text by Heading in Markdown HTML

To find paragraphs by heading in an HTML written in Markdown and return the text of the entire paragraph, you can use the 'BeautifulSoup' library in Python. Here's an example code that demonstrates how to achieve this:

import requests
from bs4 import BeautifulSoup

# Fetch the HTML content
url = 'https://example.com'  # replace with your URL
response = requests.get(url)
html = response.text

# Parse the HTML using BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# Find all the headings and their corresponding paragraphs
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
paragraphs = soup.find_all('p')

# Extract the text of each paragraph under the corresponding heading
result = {}
current_heading = None

for p in paragraphs:
    previous = p.previous_sibling
    if previous in headings:
        current_heading = previous.text
        result[current_heading] = ''
    if current_heading:
        result[current_heading] += p.text

# Print the paragraphs under each heading
for heading, paragraph in result.items():
    print('Heading:', heading)
    print('Paragraph:', paragraph)
    print()

Make sure to replace the 'url' variable with the URL of the HTML page you want to process. The code will find all headings ('h1' to 'h6') and paragraphs ('p') in the HTML, and then associate each paragraph with its corresponding heading. Finally, it will print the paragraphs under each heading.

Python BeautifulSoup: Extract Paragraph Text by Heading in Markdown HTML