See below for frequently asked questions and answers for ND027: Data Engineer.
Note: Links are only available for actively enrolled learners in a Nanodegree Program or single paid course.
PROJECT 1: Data Modeling with Postgres
- song_id and artist_id in songplays table are NULL
- How to download dataset for the project?
- How to run create_tables.py?
- How to download dataset from your workspace?
- How to develop locally with docker image?
PROJECT 2: Data Modeling with Apache Casandra
- Query 2&3 throw: Row object has no attribute 'firstName' error.
- Do the checkpoint files in event_data/.ipynb_checkpoints/2018-11-01-events-checkpoint.csv need to be there?
- Why is the order of columns in column_definition of CREATE TABLE important? (Cassandra)
- Error in clustering columns?
- No data returned on running first query. Help!
- Why Cassandra doesn't return all rows for a query if we don't have a unique primary key?
- A partition key with 2 clustering columns or a composite partition key with one clustering column?
PROJECT 3: Data Warehouse
- Issues connecting to Redshift cluster in Lesson 3 Exercise 2.
- Cluster available but not visible in AWS
- Which staging storage will be used? S3 or Redshift?
PROJECT 4: Data Lakes
- `Operation timed out` on the ssh connection
- Create EMR cluster using CLI
- Why songplays table partition by year and month, and how?
- How much time to write into S3?
- How to run ETL.py?
PROJECT 5: Data pipelines with airflow
- Cannot access udac-data-pipelines
- Should there be an hourly schedule of staging?
- load_songplays_table task is getting failed in DAG in final project
- time table schema missing in Data Pipelines table?
- General questions about airflow
PROJECT 6: Data Capstone
- Sample data architecture and database schema
- How to read multiple .sas7bdat files into data frame ?
- Error: Vpc associated with db subnet group does not exist
- Where do I get datasets from?