# Quick Start Guide for Web2Vec Web2Vec is a comprehensive library designed to convert websites into vector parameters. It provides ready-to-use implementations of web crawlers using Scrapy, making it accessible for less experienced researchers. This tool is invaluable for website analysis tasks, including SEO, disinformation detection, and phishing identification. ## Installation Install Web2Vec using pip: ```bash pip install web2vec ``` ## Configuration Configure the library using environment variables or configuration files. ```shell export WEB2VEC_CRAWLER_SPIDER_DEPTH_LIMIT=2 export WEB2VEC_DEFAULT_OUTPUT_PATH=/home/admin/crawler/output export WEB2VEC_OPEN_PAGE_RANK_API_KEY=XXXXX ``` ## Crawling websites and extract parameters ```python import os from scrapy.crawler import CrawlerProcess import web2vec as w2v process = CrawlerProcess( settings={ "FEEDS": { os.path.join(w2v.config.crawler_output_path, "output.json"): { "format": "json", "encoding": "utf8", } }, "DEPTH_LIMIT": w2v.config.crawler_spider_depth_limit, "LOG_LEVEL": "INFO", } ) process.crawl( w2v.Web2VecSpider, start_urls=["http://quotes.toscrape.com/"], # pages to process allowed_domains=["quotes.toscrape.com"], # domains to process for links extractors=w2v.ALL_EXTRACTORS, # extractors to use ) process.start() ``` and you will get files with similar structure: ```json sample content ```json { "url": "http://quotes.toscrape.com/", "title": "Quotes to Scrape", "html": "\n\n\n\t\n\tQuotes to Scrape\n \n \n\n\n
\n
\n
\n

\n Quotes to Scrape\n

\n
\n
\n

\n \n Login\n \n

\n
\n
\n \n\n
\n
\n\n
\n \u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n change\n \n deep-thoughts\n \n thinking\n \n world\n \n
\n
\n\n
\n \u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d\n by J.K. Rowling\n (about)\n \n
\n Tags:\n \n \n abilities\n \n choices\n \n
\n
\n\n
\n \u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n inspirational\n \n life\n \n live\n \n miracle\n \n miracles\n \n
\n
\n\n
\n \u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d\n by Jane Austen\n (about)\n \n
\n Tags:\n \n \n aliteracy\n \n books\n \n classic\n \n humor\n \n
\n
\n\n
\n \u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d\n by Marilyn Monroe\n (about)\n \n
\n Tags:\n \n \n be-yourself\n \n inspirational\n \n
\n
\n\n
\n \u201cTry not to become a man of success. Rather become a man of value.\u201d\n by Albert Einstein\n (about)\n \n
\n Tags:\n \n \n adulthood\n \n success\n \n value\n \n
\n
\n\n
\n \u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d\n by Andr\u00e9 Gide\n (about)\n \n
\n Tags:\n \n \n life\n \n love\n \n
\n
\n\n
\n \u201cI have not failed. I've just found 10,000 ways that won't work.\u201d\n by Thomas A. Edison\n (about)\n \n
\n Tags:\n \n \n edison\n \n failure\n \n inspirational\n \n paraphrased\n \n
\n
\n\n
\n \u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d\n by Eleanor Roosevelt\n (about)\n \n
\n Tags:\n \n \n misattributed-eleanor-roosevelt\n \n
\n
\n\n
\n \u201cA day without sunshine is like, you know, night.\u201d\n by Steve Martin\n (about)\n \n
\n Tags:\n \n \n humor\n \n obvious\n \n simile\n \n
\n
\n\n \n
\n
\n \n

Top Ten tags

\n \n \n love\n \n \n \n inspirational\n \n \n \n life\n \n \n \n humor\n \n \n \n books\n \n \n \n reading\n \n \n \n friendship\n \n \n \n friends\n \n \n \n truth\n \n \n \n simile\n \n \n \n
\n
\n\n
\n \n\n", "response_headers": { "b'Content-Length'": "[b'11054']", "b'Date'": "[b'Tue, 23 Jul 2024 06:05:10 GMT']", "b'Content-Type'": "[b'text/html; charset=utf-8']" }, "status_code": 200, "extractors": [ { "name": "DNSFeatures", "result": { "domain": "quotes.toscrape.com", "records": [ { "record_type": "A", "ttl": 225, "values": [ "35.211.122.109" ] }, { "record_type": "CNAME", "ttl": 225, "values": [ "ingress.prod-01.gcp.infra.zyte.group." ] } ] } } ] } ``` ## Website analysis Websites can be analysed without scrapping process, by using extractors directly. For example to get data from SimilarWeb for given domain you have just to call appropriate method: ```python import web2vec as w2v domain_to_check = "down.pcclear.com" entry = w2v.get_similar_web_features(domain_to_check) print(entry) ``` If you would like to test ``Web2Vec`` functionalities without installing it on your machine consider using the preconfigured [Jupyter notebook](jupyter/web2vec.ipynb). How to create own website related dataset using Web2Vec is described in [this notebook](jupyter/web2vec_dataset_creation.ipynb). How to train ML model using Web2Vec dataset is described in [this notebook](jupyter/web2vec_model_training.ipynb). ## Docker usage If you want to use Web2Vec in a Docker container, please check this [README](docker/README.md) file.