web2vec.utils module

web2vec.utils.create_directories(*directories: str)[source]

Create directories if they do not exist.

web2vec.utils.entropy(string: str) float[source]

Calculate the entropy of the given string.

web2vec.utils.fetch_file_from_url(url, directory=None, headers=None, timeout=86400) str[source]

Check if the file exists in the directory and is newer than the timeout. If not, downloads the file from the URL, saves it in the directory, and returns the path.

Parameters:
  • directory – Directory where the file should be saved.

  • url – URL of the file to download.

  • timeout – Timeout in seconds (default is 86400 = day).

Returns:

File path.

web2vec.utils.fetch_file_from_url_and_read(url, directory=None, headers=None, timeout=86400) str[source]

Return the content of the file for the given URL.

web2vec.utils.fetch_url(url, headers=None, ssl_verify=None)[source]

Fetch the given URL and return the response.

web2vec.utils.get_domain_from_url(url: str) str[source]

Extract the domain from the URL.

web2vec.utils.get_file_path_for_url(url, directory=None, timeout=86400) str[source]

Return the path to the file for the given URL.

web2vec.utils.get_github_repo_release_info(repo: str) dict[source]

Return the latest release information for the given GitHub repository.

web2vec.utils.get_ip_from_domain(domain: str) str[source]

Return the IP address for the given domain.

web2vec.utils.get_ip_from_url(url: str) str[source]

Return the IP address for the given URL.

web2vec.utils.is_numerical_type(obj: object) bool[source]

Check if the given object is a simple type.

web2vec.utils.sanitize_filename(filename)[source]

Sanitize the filename by replacing invalid characters.

web2vec.utils.store_json(data: dict, file_path: str)[source]

Store the given data as a JSON file.

web2vec.utils.transform_value(obj: object) object[source]

Transform the given object to a simple type.

web2vec.utils.valid_ip(host: str) bool[source]

Check if the given host is a valid IP address.