Jump to content

Technology:Databricks: Difference between revisions

From Codex
Created page with "# Setup > https://docs.databricks.com/en/getting-started/onboarding-account.html 1. create stack - done 2. create compute resource"
 
No edit summary
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
# Setup  
== Background ==
> https://docs.databricks.com/en/getting-started/onboarding-account.html
Friday Sept 27th decided to learn what databricks is
1. create stack - done
 
2. create compute resource
== Setup ==
https://docs.databricks.com/en/getting-started/onboarding-account.html
 
[[File:db-infinite-json-files.png|200px|thumb|right|Upload files to volume]]
[[File:db-notebook.png|200px|thumb|right|processing with notebook]]
# create stack - done
# create compute resource - done / started serverless starter warehouse
# connect workspace to data sources
#* s3://.../374226360171826
# added volume/directory
#* test20/default/inventory/all-stock
#* 78 files I think, each about 65MB JSON
# Created Notebook
#* now I'm attempting to read the files, and perhaps I loaded too many... this is taking a while for 1 node to "setup"
#* referenced the volume properly, now waiting for the notebook to process the records
#* reference https://www.databricks.com/glossary/pyspark
#* 20 minutes fighting what spark thought was corrupt json, turns out this worked `df = spark.read.option("multiline", "true").json(file_path)`

Latest revision as of 22:48, 27 September 2024

Background

Friday Sept 27th decided to learn what databricks is

Setup

https://docs.databricks.com/en/getting-started/onboarding-account.html
Upload files to volume
processing with notebook
  1. create stack - done
  2. create compute resource - done / started serverless starter warehouse
  3. connect workspace to data sources
    • s3://.../374226360171826
  4. added volume/directory
    • test20/default/inventory/all-stock
    • 78 files I think, each about 65MB JSON
  5. Created Notebook
    • now I'm attempting to read the files, and perhaps I loaded too many... this is taking a while for 1 node to "setup"
    • referenced the volume properly, now waiting for the notebook to process the records
    • reference https://www.databricks.com/glossary/pyspark
    • 20 minutes fighting what spark thought was corrupt json, turns out this worked `df = spark.read.option("multiline", "true").json(file_path)`