We can use ClickHouse online service with full data access to make ClikcHouse sample dataset and use OpenDigger to explore the data.
docker pull --platform linux/amd64 registry.cn-beijing.aliyuncs.com/open-digger/open-digger-clickhouse-base:v2
docker pull --platform linux/arm64 registry.cn-beijing.aliyuncs.com/open-digger/open-digger-clickhouse-base:v2
To use sample data from OSS service, you need to follow the steps:
Download data from OSS. We provide several sample datasets in the table below.
Extract data from archive file to a folder: tar -zxvf data.tar.gz -C ./folder_name
. You will get a table
and a data
file.
Use ClickHouse base image with extracted data to initialize the database. The extracted data
and table
file should be mounted into /data/
folder into the container, here is an example:
docker run -d --name container_name -m 8G -p 8123:8123 -p 9000:9000 --ulimit nofile=262144:262144 --volume=$(pwd)/folder_name/:/data/ registry.cn-beijing.aliyuncs.com/open-digger/open-digger-clickhouse-base:v2
;docker run -d --name container_name -m 8G -p 8123:8123 -p 9000:9000 --ulimit nofile=262144:262144 --volume=%cd%/folder_name/:/data/ registry.cn-beijing.aliyuncs.com/open-digger/open-digger-clickhouse-base:v2
.In the above command lines, $(pwd)
or %cd%
makes sure the host-src
be an absolute path.
Notice: As referred in Docker’s Doc, the
host-src
in--volume=[host-src:]container-dest[:<options>]
must be an absolute path or a name value.
If you supply an absolute path for the
host-src
, Docker bind-mounts to the path you specify.If you supply a name, Docker creates a named volume by that name.
A name value must start with an alphanumeric character, followed by
a-z0-9
,_
(underscore),.
(period) or-
(hyphen). An absolute path starts with a/
(forward slash).
Insert data done.
logged into container console. Now the Clickhouse container is running. Stop and restart the same container instance will not import data again.To use the sample data, at minimum 8 GB memory should be allocated to the container instance.
Data | Description | SQL | Record counts | Uncompressed size | Compressed size | Imported size(est.) | Import time(est.) |
---|---|---|---|---|---|---|---|
2020_full | All records from year 2020 | sql_files/2020_full.sql | 855 million | 802 GB | 81 GB | 121 GB | 7 h |
2015_2021_top_50_year | Top 50 most active repos from year 2015 to 2021 for every year | sql_files/2015_2021_top50_year.sql | 168 million | 117 GB | 8.4 GB | 13 GB | 50 m |
second_sample | All events log sample by 1 second in a hour | sql_files/second_sample.sql | 62 million | 57 GB | 10 GB | 14 GB | 25 m |
label_2015 | All events log for labeled repo in OpenDigger in 2015 | sql_files/label_2015.sql | 3.5 million | 2.9 GB | 378 MB | 552 MB | 3 m |
paddle_hackathon_3 | Data under PaddlePaddle org for Hackathon | sql_files/paddle_hackathon_3.sql | 1.38 million | 1.4 GB | 184 MB | 222 MB | 1 m |
kernel_name | kernel_src_dir | kernel_requirements | kernel_readme | kernel_build_run | kernel_base_img |
---|---|---|---|---|---|
node.js | ”./src” | ”./package.json”->”dependencies” | ”./sample_data/README.md” | ”./package.json” | ”./package.json”->”scripts”.”notebook”->”docker pull {kernel_base_img}” |
python | ”./python” | ”./python/requirements.txt” | ”./python/README.md” | docker command | “continuumio/miniconda3” |
python_v2 | ”./python_v2” | ”./python_v2/requirements.txt” | ”./python_v2/README.md” | docker command | “continuumio/miniconda3” |
pycjs | ”./pycjs” | ”./pycjs/requirements.txt” | ”./pycjs/README.md” | ”./package.json” | ”./package.json”->”scripts”.”notebook”->”docker pull {kernel_base_img}” |
The kernel node.js only depends on the JavaScript envronment, remains up-to-date.
The kernel python and kernel python_v2 only depend on the Python envronment, stopped updating. The python_v2 is more updated.
The kernel pycjs depend on the JavaScript envronment and a node_vm2 Python package, always automatically updates synchronously with the kernel node.js, which is a Python interface where the values of variables are retrieved from node.js sandbox created by the VM.
Recommended kernels: kernel node.js for TypeScript/JavaScript language, kernel pycjs for python language. Warning: Make sure the implements of metrics are consistent with its definition when using the kernel python or the kernel python_v2.
Start your ClickHouse container, which should be set up in the last step. Now:
Clone OpenDigger git clone https://github.com/X-lab2017/open-digger.git
Enter the repo path cd open-digger
Install the necessary packages npm install
Go to the src
folder in the open-digger root directory, create a file named ‘local_config.ts’ with the following contents:
export default {
db: {
clickhouse: {
host: '172.17.0.1'
}
}
}
Use npm run notebook
to use Notebook image if you use Linux/MacOS system, or to use npm run notebook:win
if you use Windows system.
Open the link in console log like http://127.0.0.1:8888/lab?token=xxxxx
.
If the source code under src
folder changed, you need to use npm run build
and restart the notebook kernel to reload the sorce code.
You can find the notebook folder, where we provide demos in the handbook. You can create a new file, and happy data exploring!
The format $${}
represents the values of a chosen Python kernel in the Kernels table.
Start your ClickHouse container, which should be set up in the last step. Now:
Clone OpenDigger git clone https://github.com/X-lab2017/open-digger.git
Enter the repo path cd open-digger
*If use the kernel pycjs: Install the necessary packages npm install
.
Go to the $${kernel_src_dir}
folder in the open-digger root directory, create a file named ‘local_config.py’ for Python Kernel with the following contents:
local_config = {
'db': {
'clickhouse': {
'host':'172.17.0.1',
'user':'default'
},
'neo4j':{
'port': '7687',
}
}
}
the host
above is the host of the ClickHouse server. We can find it using docker inspect container_name
(the container_name is set by command docker run –name xxx), and copy the Gateway
like this:
$ docker inspect container_name | grep Gateway
"Gateway": "172.17.0.1",
"IPv6Gateway": "",
"Gateway": "172.17.0.1",
"IPv6Gateway": "",
Return the repo path cd open-digger
.
If use the kernel pycjs: Build ts npm run build
. Since the npm run build command is important to active every settings change, the kernel pycjs supports npm run notebook-pycjs
to execute the *npm run build, docker build and docker run command automatically, instead of manually executing them step by step as below.
Use docker build --build-arg KER_REL_PATH='$${kernel_src_dir}' --build-arg BASE_IMAGE='$${kernel_base_img}' -t opendigger-jupyter-python:1.0 $(pwd)
to make a docker image. The base python image is based on miniconda
. You can check the Dockerfile
in root directory.
If you are using Windows CMD, all the
$(pwd)
here should be replaced by%cd%
. And if you are using Windows Powershell, all the$(pwd)
here should be replaced by${pwd}
.Notice: Pathnames of directories like “pwd” may use
\
to join the directory in some versions of Windows. We recommend using absolute paths.
Then we can use docker run -i -t --name python_notebook_name --rm -p 8888:8888 -v "$(pwd):/python_kernel/notebook" opendigger-jupyter-python:1.0
to create and run the container.
Open the link in console log like http://127.0.0.1:8888/lab?token=xxxxx
.
If the source code under src
folder changed, you need to stop the notebook docker using docker stop python_notebook_name
and restart the notebook kernel using docker run -i -t --name python_notebook_name --rm -p 8888:8888 -v "$(pwd):/python_kernel/notebook" opendigger-jupyter-python:1.0
to reload the sorce code.
You can find the notebook folder, where we provide demos in the handbook. You can create a new file, and happy data exploring!
The file export_sample.sh
is used to export sample data from remote ClickHouse server.
You can pass in two parameters, the first one is a file with SQL to export the data you need, the second one is a tag name used as a path param to upload to OSS.
Environment variables need to be set before the shell script run.
Run CH_SERVER=localhost CH_PORT=8123 CH_USER=amdin CH_PASSWORD=amdin ./export_sample.sh ./sql_files/2020_full.sql 2020_full
to export data from local ClickHouse server, the sample data will save into data
folder and the data files(data.tar.gz which contains data and table schema) will upload to OSS.
System prerequisite: clickhouse
CLI tool and ossutil
CLI tool.
The files under build
is used to make base images to load the data made by export_sample.sh
.
Dockerfile
is used to build the images from ClickHouse official base image for amd64
and arm64
platform. Please make sure that buildx
and buildkit
is properly setup in your environment.
initdb.sh
script is used to initialize database from static dataset.