web2vec.extractors.html_body_features module
- class web2vec.extractors.html_body_features.HtmlBodyFeatures(contains_forms: bool, contains_obfuscated_scripts: bool, contains_suspicious_keywords: bool, body_length: int, num_titles: int, num_images: int, num_links: int, script_length: int, special_characters: int, script_to_special_chars_ratio: float, script_to_body_ratio: float, body_to_special_char_ratio: float, iframe_redirection: int, mouse_over_effect: int, right_click_disabled: int, num_scripts_http: int, num_styles_http: int, num_iframes_http: int, num_external_scripts: int, num_external_styles: int, num_external_iframes: int, num_meta_tags: int, num_forms: int, num_forms_post: int, num_forms_get: int, num_forms_external_action: int, num_hidden_elements: int, num_safe_anchors: int, num_media_http: int, num_media_external: int, num_email_forms: int, num_internal_links: int, favicon_url: Optional[str], logo_url: Optional[str], found_forms: List[Dict[str, Any]] = <factory>, found_images: List[Dict[str, Any]] = <factory>, found_anchors: List[Dict[str, Any]] = <factory>, found_media: List[Dict[str, Any]] = <factory>, copyright: Optional[str] = None, source_mode: str = 'raw_http', was_js_rendered: bool = False, likely_js_spa: bool = False, html_snapshot_path: Optional[str] = None, num_network_requests: int = 0, num_external_network_requests: int = 0, num_api_endpoints: int = 0, found_network_requests: List[str] = <factory>, found_api_endpoints: List[str] = <factory>)[source]
Bases:
object- body_length: int
- body_to_special_char_ratio: float
- contains_forms: bool
- contains_obfuscated_scripts: bool
- contains_suspicious_keywords: bool
- copyright: str | None = None
- favicon_url: str | None
- found_anchors: List[Dict[str, Any]]
- found_api_endpoints: List[str]
- found_forms: List[Dict[str, Any]]
- found_images: List[Dict[str, Any]]
- found_media: List[Dict[str, Any]]
- found_network_requests: List[str]
- html_snapshot_path: str | None = None
- iframe_redirection: int
- likely_js_spa: bool = False
- logo_url: str | None
- mouse_over_effect: int
- num_api_endpoints: int = 0
- num_email_forms: int
- num_external_iframes: int
- num_external_network_requests: int = 0
- num_external_scripts: int
- num_external_styles: int
- num_forms: int
- num_forms_external_action: int
- num_forms_get: int
- num_forms_post: int
- num_iframes_http: int
- num_images: int
- num_internal_links: int
- num_links: int
- num_media_external: int
- num_media_http: int
- num_meta_tags: int
- num_network_requests: int = 0
- num_safe_anchors: int
- num_scripts_http: int
- num_styles_http: int
- num_titles: int
- right_click_disabled: int
- script_length: int
- script_to_body_ratio: float
- script_to_special_chars_ratio: float
- source_mode: str = 'raw_http'
- special_characters: int
- was_js_rendered: bool = False
- web2vec.extractors.html_body_features.body_length(soup: BeautifulSoup) int[source]
Get the length of the body text in the given HTML content.
- web2vec.extractors.html_body_features.body_to_special_char_ratio(soup: BeautifulSoup) float[source]
Get the ratio of body length to special characters in the given HTML content.
- web2vec.extractors.html_body_features.check_obfuscated_scripts(soup: BeautifulSoup) bool[source]
Check if the response contains any obfuscated scripts.
- web2vec.extractors.html_body_features.check_suspicious_keywords(soup: BeautifulSoup, keywords: List[str] | None = None) bool[source]
Check if the response contains any suspicious keywords.
- web2vec.extractors.html_body_features.detect_api_endpoints(urls: List[str]) List[str][source]
Return URLs that look like API/JSON endpoints.
- web2vec.extractors.html_body_features.detect_likely_js_spa(soup: BeautifulSoup) bool[source]
Heuristic signal that a page likely depends on JS rendering.
- web2vec.extractors.html_body_features.find_copyright(soup: BeautifulSoup) str | None[source]
Find the copyright information in the given HTML content.
- web2vec.extractors.html_body_features.find_favicon(soup: BeautifulSoup) str | None[source]
Find the favicon URL in the given HTML content.
- web2vec.extractors.html_body_features.find_logo(soup: BeautifulSoup) str | None[source]
Find the logo URL in the given HTML content.
- web2vec.extractors.html_body_features.get_html_body_features(body: str, url: str, source_mode: str = 'raw_http', was_js_rendered: bool = False, html_snapshot_path: str | None = None, network_request_urls: List[str] | None = None) HtmlBodyFeatures[source]
Extract HTML body features from the
Get the number of hidden elements in the given HTML content.
- web2vec.extractors.html_body_features.iframe_redirection(soup: BeautifulSoup) int[source]
Check if the response contains any iframe redirection.
- web2vec.extractors.html_body_features.is_external_url(url: str, base_domain: str) bool[source]
Return True when URL points outside current page domain.
- web2vec.extractors.html_body_features.mouse_over_effect(soup: BeautifulSoup) int[source]
Check if the response contains any mouse-over effect.
- web2vec.extractors.html_body_features.num_email_forms(soup: BeautifulSoup) int[source]
Get the number of email forms in the given HTML content.
- web2vec.extractors.html_body_features.num_external_iframes(soup: BeautifulSoup, base_domain: str) int[source]
Get the number of external iframes in the given HTML content.
- web2vec.extractors.html_body_features.num_external_scripts(soup: BeautifulSoup, base_domain: str) int[source]
Get the number of external scripts in the given HTML content.
- web2vec.extractors.html_body_features.num_external_styles(soup: BeautifulSoup, base_domain: str) int[source]
Get the number of external stylesheets in the given HTML content.
- web2vec.extractors.html_body_features.num_forms(soup: BeautifulSoup) int[source]
Get the number of forms in the given HTML content.
- web2vec.extractors.html_body_features.num_forms_external_action(soup: BeautifulSoup, base_domain: str) int[source]
Get the number of forms with external action in the given HTML content.
- web2vec.extractors.html_body_features.num_forms_get(soup: BeautifulSoup) int[source]
Get the number of GET forms in the given HTML content.
- web2vec.extractors.html_body_features.num_forms_post(soup: BeautifulSoup) int[source]
Get the number of POST forms in the given HTML content.
- web2vec.extractors.html_body_features.num_iframes_http(soup: BeautifulSoup) int[source]
Get the number of HTTP iframes in the given HTML content.
- web2vec.extractors.html_body_features.num_images(soup: BeautifulSoup) int[source]
Get the number of images in the given HTML content.
- web2vec.extractors.html_body_features.num_internal_links(soup: BeautifulSoup, base_domain: str) int[source]
Get the number of internal links in the given HTML content.
- web2vec.extractors.html_body_features.num_links(soup: BeautifulSoup) int[source]
Get the number of links in the given HTML content.
- web2vec.extractors.html_body_features.num_media_external(soup: BeautifulSoup, base_domain: str) int[source]
Get the number of external media in the given HTML content.
- web2vec.extractors.html_body_features.num_media_http(soup: BeautifulSoup) int[source]
Get the number of HTTP media in the given HTML content.
- web2vec.extractors.html_body_features.num_meta_tags(soup: BeautifulSoup) int[source]
Get the number of meta tags in the given HTML content.
- web2vec.extractors.html_body_features.num_safe_anchors(soup: BeautifulSoup, base_domain: str) int[source]
Get the number of safe anchors in the given HTML content.
- web2vec.extractors.html_body_features.num_scripts_http(soup: BeautifulSoup) int[source]
Get the number of HTTP scripts in the given HTML content.
- web2vec.extractors.html_body_features.num_styles_http(soup: BeautifulSoup) int[source]
Get the number of HTTP stylesheets in the given HTML content.
- web2vec.extractors.html_body_features.num_titles(soup: BeautifulSoup) int[source]
Get the number of titles in the given HTML content.
- web2vec.extractors.html_body_features.right_click_disabled(soup: BeautifulSoup) int[source]
Check if the response contains any right-click disabled content.
- web2vec.extractors.html_body_features.script_length(soup: BeautifulSoup) int[source]
Get the length of the scripts in the given HTML content.
- web2vec.extractors.html_body_features.script_to_body_ratio(soup: BeautifulSoup) float[source]
Get the ratio of script length to body length in the given HTML content.