In this section, we present how to use a web crawler within MindsDB.

A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. Within the realm of MindsDB, a web crawler can be employed to harvest data, which can be used to train models, domain specific chatbots or fine-tune LLMs.

Prerequisites

Before proceeding, ensure the following prerequisites are met:

  1. Install MindsDB locally via Docker or use MindsDB Cloud.
  2. To connect Web Crawler to MindsDB, install the required dependencies following this instruction.
  3. Install or ensure access to Web Crawler.

Connection

This handler does not require any connection parameters.

Here is how to initialize a web crawler:

CREATE DATABASE my_web 
WITH ENGINE = 'web';

Usage

Get Websites Content

Here is how to get the content of docs.mindsdb.com:

SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 1;

You can also get the content of internal pages. Here is how to fetch the content from 10 internal pages:

SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 10;

Another option is to get the content from multiple websites.

SELECT * 
FROM my_web.crawler 
WHERE url IN ('docs.mindsdb.com', 'docs.python.org') 
LIMIT 1;

Get PDF Content

MindsDB accepts file uploads of csv, xlsx, xls, sheet, json, and parquet. However, you can utilize the web crawler to fetch data from pdf files.

SELECT * 
FROM my_web.crawler 
WHERE url = '<link-to-pdf-file>' 
LIMIT 1;

For example, you can provide a link to a pdf file stored in Amazon S3.