Technology:Databricks: Difference between revisions
Appearance
No edit summary |
No edit summary |
||
| (13 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== Background == | |||
Friday Sept 27th decided to learn what databricks is | |||
== Setup == | == Setup == | ||
https://docs.databricks.com/en/getting-started/onboarding-account.html | https://docs.databricks.com/en/getting-started/onboarding-account.html | ||
[[File:db-infinite-json-files.png|200px|thumb|right|Upload files to volume]] | |||
[[File:db-notebook.png|200px|thumb|right|processing with notebook]] | |||
# create stack - done | # create stack - done | ||
# create compute resource | # create compute resource - done / started serverless starter warehouse | ||
# connect workspace to data sources | |||
#* s3://.../374226360171826 | |||
# added volume/directory | |||
#* test20/default/inventory/all-stock | |||
#* 78 files I think, each about 65MB JSON | |||
# Created Notebook | |||
#* now I'm attempting to read the files, and perhaps I loaded too many... this is taking a while for 1 node to "setup" | |||
#* referenced the volume properly, now waiting for the notebook to process the records | |||
#* reference https://www.databricks.com/glossary/pyspark | |||
#* 20 minutes fighting what spark thought was corrupt json, turns out this worked `df = spark.read.option("multiline", "true").json(file_path)` | |||
Latest revision as of 22:48, 27 September 2024
Background
Friday Sept 27th decided to learn what databricks is
Setup
https://docs.databricks.com/en/getting-started/onboarding-account.html


- create stack - done
- create compute resource - done / started serverless starter warehouse
- connect workspace to data sources
- s3://.../374226360171826
- added volume/directory
- test20/default/inventory/all-stock
- 78 files I think, each about 65MB JSON
- Created Notebook
- now I'm attempting to read the files, and perhaps I loaded too many... this is taking a while for 1 node to "setup"
- referenced the volume properly, now waiting for the notebook to process the records
- reference https://www.databricks.com/glossary/pyspark
- 20 minutes fighting what spark thought was corrupt json, turns out this worked `df = spark.read.option("multiline", "true").json(file_path)`