Python
I have been using pyenv+venv to build Python environments. There are several package management tools for Pyton, but they are chaotic and I have not touched them until now. In the midst of all this, I had a chance to use Poetry, so I decid…
Since I have been experimenting with Selenium recently, I thought I would make a note of how to use Selenium so that if I forget how to use it, I can remember again. install Selenium pip install selenium The browser and the WebDriver that …
I use Puppeteer for web crawling. I had hoped to someday put together a guide on how to use Puppeteer, but in the meantime, I have been hearing a lot about Playwright as a browser manipulation tool similar to Puppeteer. Then I started thin…
Since Spark is a Python program, it can be written quite freely. However, since I always have a general idea of what I need to do, and knowing various ways of writing Spark makes it harder to remember, I will summarize my personal frequent…
I have tried AWS Glue before and here are my thoughts on it. AWS Glue is made of Apache Spark, and it was interesting for me to touch Spark for the first time at that time, I was thinking that I would put together a usage guide for Glue wh…
How to access AWS S3 from Spark (Google Dataproc). procedure Spark Configuration The following Spark and Haddop settings will allow you to read and write AWS S3 files from Spark. Load the following AWS-related jar files into Spark aws-java…
How to access Microsoft SQL Server (Azure Database) from Spark (Google Dataproc). procedure Spark Configuration The following Spark settings will allow you to read and write SQL Server data from Spark. Download the MS SQL Server JDBC jar f…
How to access MySQL from Spark (Google Dataproc). Since it is accessed using JDBC, it can be applied to other RDBs such as PostgreSQL. procedure Spark Configuration The following Spark settings will allow you to read and write MySQL data f…
BigQuery can handle huge amounts of data, but you don't have to worry about the infrastructure at all (really, at all), and it's fast and cheap, It is tempting to put all your data into BigQuery and process it all with BigQuery. That's why…
I wanted to get a list of Google Cloud Storage subdirectories using the GCP Python library, I got stuck, so here's a note on how and why. Libraries in other languages are API wrappers as well as Python libraries, so I think they can be app…
Tableau not only displays graphs, but also provides cluster analysis capabilities. Python is free, and it is easy enough to perform cluster analysis with Python, but when trying to perform cluster analysis in Python, work such as standardi…
Since the version of Ubuntu in WSL (Windows Subsystem for Linux) was getting old at 16, we installed a new Ubuntu with version 18. I also re-set up Python3 + Jupyter Notebook accordingly, but I ran into a few snags along the way, so here a…