Technology:Databricks: Difference between revisions

Latest revision as of 22:48, 27 September 2024

Friday Sept 27th decided to learn what databricks is

https://docs.databricks.com/en/getting-started/onboarding-account.html

create stack - done
create compute resource - done / started serverless starter warehouse
connect workspace to data sources
- s3://.../374226360171826
added volume/directory
- test20/default/inventory/all-stock
- 78 files I think, each about 65MB JSON
Created Notebook
- now I'm attempting to read the files, and perhaps I loaded too many... this is taking a while for 1 node to "setup"
- referenced the volume properly, now waiting for the notebook to process the records
- reference https://www.databricks.com/glossary/pyspark
- 20 minutes fighting what spark thought was corrupt json, turns out this worked `df = spark.read.option("multiline", "true").json(file_path)`

@@ Line 1: / Line 1: @@
+== Background ==
+Friday Sept 27th decided to learn what databricks is
 == Setup ==
   https://docs.databricks.com/en/getting-started/onboarding-account.html
+[[File:db-infinite-json-files.png|200px|thumb|right|Upload files to volume]]
+[[File:db-notebook.png|200px|thumb|right|processing with notebook]]
 # create stack - done
-# create compute resource
+# create compute resource - done / started serverless starter warehouse
+# connect workspace to data sources
+#* s3://.../374226360171826
+# added volume/directory
+#* test20/default/inventory/all-stock
+#* 78 files I think, each about 65MB JSON
+# Created Notebook
+#* now I'm attempting to read the files, and perhaps I loaded too many... this is taking a while for 1 node to "setup"
+#* referenced the volume properly, now waiting for the notebook to process the records
+#* reference https://www.databricks.com/glossary/pyspark
+#* 20 minutes fighting what spark thought was corrupt json, turns out this worked `df = spark.read.option("multiline", "true").json(file_path)`