谷歌云数据工程师考试 - BigQuery复习笔记

2018-07-14  本文已影响0人  塞小娜

这周一直在准备谷歌云数据工程师的考试 (Google Cloud Data Engineer Certification),就把一篇学习笔记贴上来。

时间仓促暂时没法翻译成中文,希望看官们谅解。

BigQuery Summary

What is?
Google's fully managed, petabyte scale, low cost analytics data warehouse.

1.png

Serverless?
Yes

Benefits

Allows near real-time analysis
-> not totally real-time
-> if need milliseconds delay in transactional database, BigQuery is not the answer; the answer will be Cloud SQL or Spanner
-> but it is a good choice for ad-hoc analysis of very large datasets
-> data warehousing and business intelligence

Completely no-ops
-> no need to create cluster before querying
-> pay for amount of data processed

Pricing
-> pay for use
OR
-> flat-rate pricing

Durable
-> if you put data in BigQuery, it is just like GCS

Immutable audit logs
-> audit logs cannot be tampered with
-> you know if someone actually uses the dataset

A way to share data beyond silos of your company’s structure
-> cluster less, so can take BigQuery and share it with anybody under your company’s domain
-> anybody that you allow to access your table is able to access your table without any ops that is typically required to dealt with
-> no need to worry that “how do I log in the person’s cluster so I can use their data” it is cluster free!
-> BigQuery can become the form by which you get data collaboration across your company -> way of sharing analysis, and that is live!

How to use?

How to load data?

You can load data:

Not loaded but can query external data sources
BigQuery offers support for querying data directly from:

BigQuery supports loading data from Cloud Storage and readable data sources in the following formats:

You can load data into a new table or partition, you can append data to an existing table or partition, or you can overwrite a table or partition. For more information on working with partitions, see Managing Partitioned Tables.

When you load data into BigQuery, you can supply the table or partition schema, or for supported data formats, you can use schema auto-detection.

How to load schema?

Indexing

BigQuery does not use or support indexes.

Cost Reduction best practice

Performance Improve best practice

Billing

3.png

Free
-> loading
-> exporting
-> queries on metadata
-> cached queries: if you run a query and you run exactly the same query in that project, it’s free
-> per user cache
-> if two users on the same project, they don’t share the cache
-> query with errors

Storage
-> You get charged based on the amount of data in the table
-> Streaming data
-> get charged based on the ingest rate of that streaming data
-> automatically get a discount for older data

Processing
-> on-demand: charge for queries based on the amount of data that’s being processed, 1TB / month free
-> flat rate plan: talk with sales team
-> opt-in to run high-compute queries: JS UDF

总结就是:
data input: streaming insert收钱,其他不收
data storage: 收钱
data processing: 按量收

IAM / Permission

BigQuery uses Identity and Access Management (IAM) to manage access to resources. The three types of resources available in BigQuery are organizations, projects, and datasets. In the IAM policy hierarchy, datasets are child resources of projects. Tables and views are child resources of datasets — they inherit permissions from their parent dataset.

To grant access to a resource, assign one or more roles to a user, group, or service account. Organization and project roles affect the ability to run jobs or manage the project's resources, whereas dataset roles affect the ability to access or modify the data inside of a particular dataset.

IAM provides two types of roles: predefined and primitive roles. When you assign both predefined and primitive roles to a user, the permissions granted are a union of each role's permissions.

View Access Management

You cannot assign access controls directly to views. You can control view access by configuring access controls at the dataset level or at the project level.

Dataset-level access controls specify the operations users, groups, and service accounts are allowed to perform on views in that specific dataset. If you assign only dataset-level permissions, you must also assign a primitive or predefined, project-level role that provides access to the project, for example, bigquery.user.

Instead of granting access to individual datasets, you can assign predefined, project-level IAM roles that grant permissions to all views in all datasets in a project.

You can also create IAM custom roles. If you create a custom role, the permissions you grant depend on the view operations you want the user, group, or service account to be able to perform.

For more information on roles and permissions, see:

Encryption

By default, BigQuery encrypts customer content stored at rest. BigQuery handles and manages this default encryption for you without any additional actions on your part. First, data in a BigQuery table is encrypted using a data encryption key. Then, those data encryption keys are encrypted with key encryption keys, which is known as envelope encryption. Key encryption keys do not directly encrypt your data but are used to encrypt the data encryption keys that Google uses to encrypt your data.

If you want to control encryption yourself, you can use customer-managed encryption keys (CMEK) for BigQuery. Instead of Google managing the key encryption keys that protect your data, you control and manage key encryption keys in Cloud KMS.

BigQuery API

To use the Google BigQuery API, you must first authenticate to verify your client's identity. BigQuery authorizes access to resources based on the verified identity.

A service account is a Google account that is associated with your GCP project. Use a service account to access the BigQuery API if your application can run jobs associated with service credentials rather than an end-user's credentials, such as a batch processing pipeline.

Use user credentials to ensure that your application has access only to BigQuery tables that are available to the end user.

上一篇下一篇

猜你喜欢

热点阅读