web2vec.extractors.html_body_features module

class web2vec.extractors.html_body_features.HtmlBodyFeatures(contains_forms: bool, contains_obfuscated_scripts: bool, contains_suspicious_keywords: bool, body_length: int, num_titles: int, num_images: int, num_links: int, script_length: int, special_characters: int, script_to_special_chars_ratio: float, script_to_body_ratio: float, body_to_special_char_ratio: float, iframe_redirection: int, mouse_over_effect: int, right_click_disabled: int, num_scripts_http: int, num_styles_http: int, num_iframes_http: int, num_external_scripts: int, num_external_styles: int, num_external_iframes: int, num_meta_tags: int, num_forms: int, num_forms_post: int, num_forms_get: int, num_forms_external_action: int, num_hidden_elements: int, num_safe_anchors: int, num_media_http: int, num_media_external: int, num_email_forms: int, num_internal_links: int, favicon_url: Optional[str], logo_url: Optional[str], found_forms: List[Dict[str, Any]] = <factory>, found_images: List[Dict[str, Any]] = <factory>, found_anchors: List[Dict[str, Any]] = <factory>, found_media: List[Dict[str, Any]] = <factory>, copyright: Optional[str] = None)[source]

Bases: object

body_length: int
body_to_special_char_ratio: float
contains_forms: bool
contains_obfuscated_scripts: bool
contains_suspicious_keywords: bool
copyright: str | None = None
favicon_url: str | None
found_anchors: List[Dict[str, Any]]
found_forms: List[Dict[str, Any]]
found_images: List[Dict[str, Any]]
found_media: List[Dict[str, Any]]
iframe_redirection: int
logo_url: str | None
mouse_over_effect: int
num_email_forms: int
num_external_iframes: int
num_external_scripts: int
num_external_styles: int
num_forms: int
num_forms_external_action: int
num_forms_get: int
num_forms_post: int
num_hidden_elements: int
num_iframes_http: int
num_images: int
num_media_external: int
num_media_http: int
num_meta_tags: int
num_safe_anchors: int
num_scripts_http: int
num_styles_http: int
num_titles: int
right_click_disabled: int
script_length: int
script_to_body_ratio: float
script_to_special_chars_ratio: float
special_characters: int
web2vec.extractors.html_body_features.body_length(soup: BeautifulSoup) int[source]

Get the length of the body text in the given HTML content.

web2vec.extractors.html_body_features.body_to_special_char_ratio(soup: BeautifulSoup) float[source]

Get the ratio of body length to special characters in the given HTML content.

web2vec.extractors.html_body_features.check_obfuscated_scripts(soup: BeautifulSoup) bool[source]

Check if the response contains any obfuscated scripts.

web2vec.extractors.html_body_features.check_suspicious_keywords(soup: BeautifulSoup, keywords: List[str] | None = None) bool[source]

Check if the response contains any suspicious keywords.

Find the copyright information in the given HTML content.

web2vec.extractors.html_body_features.find_favicon(soup: BeautifulSoup) str | None[source]

Find the favicon URL in the given HTML content.

Find the logo URL in the given HTML content.

web2vec.extractors.html_body_features.get_html_body_features(body: str, url: str) HtmlBodyFeatures[source]

Extract HTML body features from the

web2vec.extractors.html_body_features.hidden_elements(soup: BeautifulSoup) int[source]

Get the number of hidden elements in the given HTML content.

web2vec.extractors.html_body_features.iframe_redirection(soup: BeautifulSoup) int[source]

Check if the response contains any iframe redirection.

web2vec.extractors.html_body_features.mouse_over_effect(soup: BeautifulSoup) int[source]

Check if the response contains any mouse-over effect.

web2vec.extractors.html_body_features.num_email_forms(soup: BeautifulSoup) int[source]

Get the number of email forms in the given HTML content.

web2vec.extractors.html_body_features.num_external_iframes(soup: BeautifulSoup, base_domain: str) int[source]

Get the number of external iframes in the given HTML content.

web2vec.extractors.html_body_features.num_external_scripts(soup: BeautifulSoup, base_domain: str) int[source]

Get the number of external scripts in the given HTML content.

web2vec.extractors.html_body_features.num_external_styles(soup: BeautifulSoup, base_domain: str) int[source]

Get the number of external stylesheets in the given HTML content.

web2vec.extractors.html_body_features.num_forms(soup: BeautifulSoup) int[source]

Get the number of forms in the given HTML content.

web2vec.extractors.html_body_features.num_forms_external_action(soup: BeautifulSoup, base_domain: str) int[source]

Get the number of forms with external action in the given HTML content.

web2vec.extractors.html_body_features.num_forms_get(soup: BeautifulSoup) int[source]

Get the number of GET forms in the given HTML content.

web2vec.extractors.html_body_features.num_forms_post(soup: BeautifulSoup) int[source]

Get the number of POST forms in the given HTML content.

web2vec.extractors.html_body_features.num_iframes_http(soup: BeautifulSoup) int[source]

Get the number of HTTP iframes in the given HTML content.

web2vec.extractors.html_body_features.num_images(soup: BeautifulSoup) int[source]

Get the number of images in the given HTML content.

Get the number of internal links in the given HTML content.

Get the number of links in the given HTML content.

web2vec.extractors.html_body_features.num_media_external(soup: BeautifulSoup, base_domain: str) int[source]

Get the number of external media in the given HTML content.

web2vec.extractors.html_body_features.num_media_http(soup: BeautifulSoup) int[source]

Get the number of HTTP media in the given HTML content.

web2vec.extractors.html_body_features.num_meta_tags(soup: BeautifulSoup) int[source]

Get the number of meta tags in the given HTML content.

web2vec.extractors.html_body_features.num_safe_anchors(soup: BeautifulSoup, base_domain: str) int[source]

Get the number of safe anchors in the given HTML content.

web2vec.extractors.html_body_features.num_scripts_http(soup: BeautifulSoup) int[source]

Get the number of HTTP scripts in the given HTML content.

web2vec.extractors.html_body_features.num_styles_http(soup: BeautifulSoup) int[source]

Get the number of HTTP stylesheets in the given HTML content.

web2vec.extractors.html_body_features.num_titles(soup: BeautifulSoup) int[source]

Get the number of titles in the given HTML content.

web2vec.extractors.html_body_features.right_click_disabled(soup: BeautifulSoup) int[source]

Check if the response contains any right-click disabled content.

web2vec.extractors.html_body_features.script_length(soup: BeautifulSoup) int[source]

Get the length of the scripts in the given HTML content.

web2vec.extractors.html_body_features.script_to_body_ratio(soup: BeautifulSoup) float[source]

Get the ratio of script length to body length in the given HTML content.

web2vec.extractors.html_body_features.script_to_special_chars_ratio(soup: BeautifulSoup) float[source]

Get the ratio of script length to special characters in the given HTML content.

web2vec.extractors.html_body_features.special_characters(soup: BeautifulSoup) int[source]

Get the number of special characters in the given HTML content.