# Quick Start Guide for Web2Vec
Web2Vec is a comprehensive library designed to convert websites into vector parameters. It provides ready-to-use implementations of web crawlers using Scrapy, making it accessible for less experienced researchers. This tool is invaluable for website analysis tasks, including SEO, disinformation detection, and phishing identification.
## Installation
Install Web2Vec using pip:
```bash
pip install web2vec
```
## Configuration
Configure the library using environment variables or configuration files.
```shell
export WEB2VEC_CRAWLER_SPIDER_DEPTH_LIMIT=2
export WEB2VEC_DEFAULT_OUTPUT_PATH=/home/admin/crawler/output
export WEB2VEC_OPEN_PAGE_RANK_API_KEY=XXXXX
```
## Crawling websites and extract parameters
```python
import os
from scrapy.crawler import CrawlerProcess
import web2vec as w2v
process = CrawlerProcess(
settings={
"FEEDS": {
os.path.join(w2v.config.crawler_output_path, "output.json"): {
"format": "json",
"encoding": "utf8",
}
},
"DEPTH_LIMIT": w2v.config.crawler_spider_depth_limit,
"LOG_LEVEL": "INFO",
}
)
process.crawl(
w2v.Web2VecSpider,
start_urls=["http://quotes.toscrape.com/"], # pages to process
allowed_domains=["quotes.toscrape.com"], # domains to process for links
extractors=w2v.ALL_EXTRACTORS, # extractors to use
)
process.start()
```
and you will get files with similar structure:
```json
sample content
```json
{
"url": "http://quotes.toscrape.com/",
"title": "Quotes to Scrape",
"html": "\n\n
\n \u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d\n by Albert Einstein\n (about)\n \n
\n \u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d\n by Albert Einstein\n (about)\n \n
\n \u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d\n by Marilyn Monroe\n (about)\n \n
\n \n\n",
"response_headers": {
"b'Content-Length'": "[b'11054']",
"b'Date'": "[b'Tue, 23 Jul 2024 06:05:10 GMT']",
"b'Content-Type'": "[b'text/html; charset=utf-8']"
},
"status_code": 200,
"extractors": [
{
"name": "DNSFeatures",
"result": {
"domain": "quotes.toscrape.com",
"records": [
{
"record_type": "A",
"ttl": 225,
"values": [
"35.211.122.109"
]
},
{
"record_type": "CNAME",
"ttl": 225,
"values": [
"ingress.prod-01.gcp.infra.zyte.group."
]
}
]
}
}
]
}
```
## Website analysis
Websites can be analysed without scrapping process, by using extractors directly. For example to get data from SimilarWeb for given domain you have just to call appropriate method:
```python
import web2vec as w2v
domain_to_check = "down.pcclear.com"
entry = w2v.get_similar_web_features(domain_to_check)
print(entry)
```
If you would like to test ``Web2Vec`` functionalities without installing it on your machine consider using the preconfigured [Jupyter notebook](jupyter/web2vec.ipynb).
How to create own website related dataset using Web2Vec is described in [this notebook](jupyter/web2vec_dataset_creation.ipynb).
How to train ML model using Web2Vec dataset is described in [this notebook](jupyter/web2vec_model_training.ipynb).
## Docker usage
If you want to use Web2Vec in a Docker container, please check this [README](docker/README.md) file.