Getting Started with Databricks
Databricks is a company founded in 2013 by the original creators of Apache Spark. The platform of the company inherits the same name, Databricks. They aim to create a unified data analytics platform, that provides a secure, collaborative environment, easy to configure and administrate, and a highly scalable Spark environment. Databricks support the use of Python, R, Scala and SQL, and a full range of the most used libraries for these languages. Beside this, the useful and remarkable characteristic of this environment is the easy way that it delivers for users to set and start to use a Spark environment in a few minutes.
Creating a Databricks Community Edition Account
As is common with most of the student or community editions of applications or web sites, you’ll need only a valid email to register to use the community edition.
To create the account, access the https://community.cloud.databricks.com/login. html
During the account creation, you’ll asked to select the version of platform that you want to use:
Databricks Platform – Free Trial Community Edition
With the free trial, the user has access to full benefits of the environment, and a full list of cluster specifications to scale and use. But, as a Trial version, you have only 14 days to use it. Besides that, you need to use your 14-days trial in one of the Databricks partners: Microsoft Azure or AWS.
With the Community Edition, your environment is more limited, and they say that you have:
- Single cluster limited to 6GB and no worker nodes
- Basic notebooks without collaboration
- Limited to 3 max users
- Public environment to share your work
At beginning, it shows a low configuration, but it enables you to do closer to everything that you do in a 14-days trial, and without losing your bases and codes. Although the sign on the page says that you have a Single cluster, during my end-of-course work for MBA, we were asked to use Databricks as an environment, and we noted that the cluster starts with a single node, but during hard processes, they start another one automatically to use as a worker.
A Workspace in Databricks is the environment that you will use to create and organize your objects. These objects consist of notebooks, libraries, and MLFlow Experiments, and you can create Folders to organize them.
Databricks delivers 3 special folders: Workspace itself, Shared and Users.
Another interesting characteristic of Databricks in their community edition that I personally see as an advantage of the trial, is that it´s an agnostic approach to learn and use pyspark, because the interface of Databricks in this native view is the same off the booth cloud partners, Azure and AWS. So, if you use it to learn, study or develop a POC, your knowledge gained will be all used with the partners platform, and your code and bases will work in Azure and AWS environment.
- Workspace Folder – it´s a root folder, that holds all the objects and assets from organization, when you use a business/organization version, or all personal objects if you’re using the community version
- Shared Folder – In a business/organization version, Shared folder is used for sharing objects across the organization.
- Users – This folder contains folder for all users in a business/organization version, or your personal folder in a community version, and fol
Creating and Starting a Cluster
As I said before, with the community edition, you have a limited, but not a worst, access to a spark cluster with single node and DBFS (Databricks File System, a file system like HDFS in Hadoop). To use This, you need to create and start the cluster, in the “Clusters” page accessed by the left bar “Clusters” option.
Clicking in the “+ Create Cluster”, you’ll have access to the “New Cluster”, where is possible to select the Databricks Runtime Version (that is the image used to create the cluster machine), 3 available Zones for the cluster, and you have the information about the limits of Databricks Community Edition: 15Gb Memory available, 2 Cores machine and that the cluster will automatically terminated after a period of 2 hours of inactivity.
The Default Databricks Runtime Version for community edition is the 6.5 with Scala 2.11, Spark 2.4.5, and Python 3 support.
With this version, you’ll capable of use the most commonly used python libraries like Pandas, Scikit Learn (remember: Pandas Dataframes didn’t perform well in a distributed environment), or to take a full advantage of Spark, pyspark, MlLib, DeltaLake and, recently released at Spark+AI Summit 2020, the Koalas 1.0, that in this first version implements the most commonly used pandas APIs, with 80% coverage of all the pandas APIs., users of Pandas gain an easy transition way from pandas to Apach Spark, since the code ins 80% compatible, without the need to convert to pyspark to use distributed processing.
In the “Create Notebook” window, we’ll set the Name of the notebook, and choose the “Default Language”. Spark supports Python, Scala, SQL, and R, and even if you select one by default, you’re able to change the language context in the notebook.
The notebook editor is very similar with the Jupyter Notebook, a commonly used interface to work with python. The shortcuts are very familiar, like: to run a command and insert a new command cell, use Shift+Enter; to run a command without inserting new command cell, use Ctrl+Enter, and other shortcuts from Jupyter Notebook.
Here you will find the cluster that you created, and where the notebook is attached, and by click, it shows the cluster options direct here:
In the File menu, you have these options. One of the interesting file options is Publish. Using Publish, you’ll turn your notebook as public, that allows everyone with a link to your notebook to view your code and results (ex: https://tinyurl.com/ybuoew3u – The notebook I created just to use in this article). When you use the Export/DBC Archive, it uses a Databricks internal format that allows you to export a single notebook or an entire folder with numerous notebooks. It is a JAR file with some extra metadata used internally in Databricks. It’s very useful when it is needed to move an entire project to another account, or for backup. Talking about backup, another interesting tool is the “Revision History”. Like a history repository, it allows you to return your code to another version.
The command cells itself have some interesting options too. It delivers an easy way to show the data displayed with the display() command in a graphical way. Think about the data below:
If I want to have a fast and simple view of the temperatures in a graphical way, I don’t need to create some code, because the button allows us to do this. After click on the button, we see this:
The graphical view is customizable, and you have a feel helpful options of graphs:
If your notebook was shared with other users for editing, the comments allow the other users to insert a reply to your comment, turning it in a useful tool for a collaborative work
By clicking the “Plot Options…”, we have a way to configure our graphic:
option. When Comments is active, the button appears, and the dev can write some
If your notebook was shared with other users for editing, the comments allow the other users to insert a reply to your comment, turning it into a useful tool for collaborative work.