Python BeautifulSoup: Extract Heading and Body Text from HTML
You can use the BeautifulSoup library in Python to parse the HTML file and extract the required information. Here's an example code snippet that fulfills your requirement:
\
from bs4 import BeautifulSoup\
\
def get_heading_and_body(html_file):\
with open(html_file, 'r') as file: \
soup = BeautifulSoup(file, 'html.parser')\
\
# Find the heading text\
heading = soup.find(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']).text\
\
# Find all body text under the heading\
body_text = []\
for p_tag in soup.find_all('p'):\
body_text.append(p_tag.text)\
\
return heading, body_text\
\
# Call the function with the HTML file path\
heading, body_text = get_heading_and_body('index.html')\
\
# Print the results\
print(f"Heading: {heading}")\
print("Body Text:")\
for text in body_text: \
print(text)\
```\
\
Make sure to replace `'index.html'` with the actual path to your HTML file. The code first reads the HTML file and then uses BeautifulSoup to parse the HTML content. It finds the first occurrence of `<h1>` to `<h6>` tags to retrieve the heading text and then finds all `<p>` tags under that heading to extract the body text. Finally, it returns the heading text and a list of body text.
原文地址: https://www.cveoy.top/t/topic/qc8R 著作权归作者所有。请勿转载和采集!