Python BeautifulSoup: Extract Paragraph Text by Heading in Markdown HTML
To find paragraphs by heading in an HTML written in Markdown and return the text of the entire paragraph, you can use the 'BeautifulSoup' library in Python. Here's an example code that demonstrates how to achieve this:
import requests
from bs4 import BeautifulSoup
# Fetch the HTML content
url = 'https://example.com' # replace with your URL
response = requests.get(url)
html = response.text
# Parse the HTML using BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Find all the headings and their corresponding paragraphs
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
paragraphs = soup.find_all('p')
# Extract the text of each paragraph under the corresponding heading
result = {}
current_heading = None
for p in paragraphs:
previous = p.previous_sibling
if previous in headings:
current_heading = previous.text
result[current_heading] = ''
if current_heading:
result[current_heading] += p.text
# Print the paragraphs under each heading
for heading, paragraph in result.items():
print('Heading:', heading)
print('Paragraph:', paragraph)
print()
Make sure to replace the 'url' variable with the URL of the HTML page you want to process. The code will find all headings ('h1' to 'h6') and paragraphs ('p') in the HTML, and then associate each paragraph with its corresponding heading. Finally, it will print the paragraphs under each heading.
原文地址: https://www.cveoy.top/t/topic/qc14 著作权归作者所有。请勿转载和采集!