End-to-End Data Engineer DATA LAKE Project (Scala Spark 3.5.1
Briefly

Airflow can trigger our scripts because in real life, organizations, companies use millions or billions of data, so you need to start the scripts when the company is not working, just think like that, our bronze line is raw data then we need to append our raw data to HDFS, but we're using our servers in shifts so we need to run out of shift time.
If we want to run our script in the Airflow container, we need to set up an SSH tunnel for our container. You can think of a container like a server in real life, so we need to open an SSH tunnel for our two servers.
# YOU CAN FIND FULL DOCKER FILES IN sparkMaster FOLDER.# SSH configurationRUN mkdir -p /var/run/sshd# THE PASSWORD FOR SPARK-MASTER, THE WILL BE SSH PASSSRUN echo 'root:screencast' | chpasswdRUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_configRUN sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config# Generate SSH keysRUN ssh-keygen -t rsa -f /root/.ssh/id_rsa -q -N ""RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
Read at Medium
[
add
]
[
|
|
]