[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss] Add EXTERNAL keyword to CREATE TABLE statement

I think that something unique along the lines of `REGISTER EXTERNAL DATA SOURCE` is probably fine, as it doesn't conflict with existing behaviors of other dialects.

> There is a lot of value in making sure our common operations closely map to the equivalent common operations in other SQL dialects. 

We're trying to make opposite points using the same arguments :) A lot of popular dialects make difference between CREATE TABLE and CREATE EXTERNAL TABLE (or similar):
 - T-SQL:
      create: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql
      create external: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
      external datasource: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
 - PL/SQL:
      create: https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#i1106369
      create external: https://docs.oracle.com/cd/B19306_01/server.102/b14215/et_concepts.htm#i1009127
 - postgres:
      import foreign schema: https://www.postgresql.org/docs/9.5/static/sql-importforeignschema.html
      create table: https://www.postgresql.org/docs/9.1/static/sql-createtable.html
 - redshift:
      create external schema: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
      create table: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
 - hive internal and external: https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables

My understanding is that the behavior of create table is somewhat similar in all of the above dialects, from the high-level perspective it usually creates a persistent table in the current storage context (database). That's not what Beam SQL's create table does right now, and my opinion is that it should not be called create table for this reason.

>  I think users will be more confused to find that 'CREATE TABLE' doesn't exist then to learn that it might not always create a table. 

I think that having CREATE TABLE do something unexpected or not do something expected (or do the opposite things depending on the table type or some flag) is worse than having users look up the correct way of creating a data source in Beam SQL without expecting something we don't promise.

>  (For example, a user guessing at the syntax of CREATE TABLE would have a better experience with the error being "field LOCATION not specified" rather than "operation CREATE TABLE not found".)

They have to look it up anyway (what format is location for a Pubsub topic? or is it a subscription?), and when doing so I think it would be less confusing to read that to get data from Pubsub/Kafka/... in Beam SQL you have to do something like `REGISTER EXTERNAL DATA SOURCE` than `CREATE TABLE`.

External tables and schemas don't have a standard approach and I don't have a strong preference between any one from the above. 

On Wed, Aug 15, 2018 at 1:08 PM Rui Wang <ruwang@xxxxxxxxxx> wrote:
Adding dev@ back now.


On Wed, Aug 15, 2018 at 1:01 PM Andrew Pilloud <apilloud@xxxxxxxxxx> wrote:
Did we drop the dev list from this on purpose? (I haven't added it back, but we probably should.)

I'm in favor of sticking with the simple 'CREATE TABLE' and 'CREATE SCHEMA' if there is only to be one option. Sticking with those names minimizes both our deviation from other implementations and user surprise. There is a lot of value in making sure our common operations closely map to the equivalent common operations in other SQL dialects. I think users will be more confused to find that 'CREATE TABLE' doesn't exist then to learn that it might not always create a table. This minimizes the overhead of learning our dialect of SQL and maximizes the odds that a user will be able to guess at the syntax of something and have it work. (For example, a user guessing at the syntax of CREATE TABLE would have a better experience with the error being "field LOCATION not specified" rather than "operation CREATE TABLE not found".)

If the goal is clarity of the operation, how about 'REGISTER EXTERNAL DATA SOURCE' and 'REGISTER EXTERNAL DATA SOURCE PROVIDER'? Those names remove the ambiguity around the operation creating and the data source being a table.


On Wed, Aug 15, 2018 at 10:54 AM Anton Kedin <kedin@xxxxxxxxxx> wrote:
My preference is to make `EXTERNAL` mandatory and only support `CREATE EXTERNAL TABLE` for existing semantics. My main reasons are:
 - user friendliness, matching expectations, readability. Current `CREATE TABLE` is basically a `CREATE EXTERNAL TABLE`. It is confusing to users familiar with SQL who expect that `CREATE TABLE` will actually create a table;
 - forward-compatibility. We could potentially support non-external `CREATE TABLE` at some point in the future, whatever semantics it might have. It will be wrong to use the same syntax for external and non-external CREATEs;

I agree that typing extra word each time is not ideal, but my opinion is on the side that readability of code (including SQL) is important (how much time you spend reading / understanding code vs writing it) and we should try to improve it if we can. In case of DDL every non-trivial statement will already have a ton of unavoidable words (field names, types, location, options) so I would argue that adding extra one word would not noticeably reduce your happiness of writing it :) But it would improve readability and reduce ambiguity, which I think is worth it.

I think that making it optional only introduces more confusion (e.g. what's the difference between the two DDL statements without reading the doc?) and would make situation worse.


On Wed, Aug 15, 2018 at 10:24 AM Mingmin Xu <mingmxus@xxxxxxxxx> wrote:
I prefer to `CREATE EXTERNAL TABLE`. My question is, do you plan to support both `CREATE TABLE` and `CREATE EXTERNAL TABLE`, by making `EXTERNAL` as optional? 

On Wed, Aug 15, 2018 at 10:01 AM, Andrew Pilloud <apilloud@xxxxxxxxxx> wrote:
I think 'CREATE EXTERNAL TABLE' might make things a bit clearer from a documentation prospective, but I'd be really unhappy if I had to type out 'EXTERNAL' every time. (I have the same concern with 'CREATE EXTERNAL SCHEMA'.)


On Tue, Aug 14, 2018 at 12:38 PM Rui Wang <ruwang@xxxxxxxxxx> wrote:
Hi guys,

I know you are probably using CREATE TABLE, Can I know your thoughts on this?


On Tue, Aug 14, 2018 at 10:22 AM Rui Wang <ruwang@xxxxxxxxxx> wrote:
Thanks Mikhail! "Import" is an alternative option. It might be better.

"create external" is being widely used by different systems with similar meaning so "create" usually is ok to external data sources.


On Tue, Aug 14, 2018 at 9:38 AM Mikhail Gryzykhin <migryz@xxxxxxxxxx> wrote:
The idea of clarification sounds good to me. I'd appreciate that present, when I was triaging post-commit tests.

Do we have any terms that specify connection to external table? "CREATE" word triggers this reaction in my brain that there will be a new table created. Adding "EXTERNAL" would already add distinction, but adding something more explicit for the task might be even better.


Have feedback

On Mon, Aug 13, 2018 at 2:40 PM Rafael Fernandez <rfernand@xxxxxxxxxx> wrote:
Strictly speaking, they are not necessarily tables either. We could also introduce something like CREATE EXTERNAL DATA SOURCE (a-la T-SQL), if it's somehow advantageous for us to leverage access patterns or restrict DML statements.

I think your idea of CREATE EXTERNAL TABLE is practical :)

On Mon, Aug 13, 2018 at 2:12 PM Rui Wang <ruwang@xxxxxxxxxx> wrote:
Hi Community,

BeamSQL allows CREATE TABLE statements to register virtual tables from external storage systems (e.g. BigQuery). 

BeamSQL is not a storage system, so any table registered by "CREATE TABLE" statement is essentially equivalent to be registered by "CREATE EXTERNAL TABLE", which requires the user to provide a LOCATION and BeamSQL will register the table outside of current execution environment based on LOCATION.

So I propose to add EXTERNAL keyword to "CREATE TABLE" in BeamSQL to help users understand they are registering tables, and BeamSQL does not create non existing tables by running CREATE TABLE (at least on some storage systems, if not all). 

We can make the EXTERNAL keyword either required or optional.

If we make the EXTERNAL keyword required:

a. We can get rid of the registering table semantic on CREATE TABLE. 
b, We keep the room that we could add CREATE TABLE back in the future if we want CREATE TABLE to create, rather than not only register tables in BeamSQL. 

1. CREATE TABLE syntax will not be supported so existing BeamSQL pipelines which has CREATE TABLE require changes.
2. It's required to type tedious EXTERNAL keyword every time, especially in SQL Shell.

If we make the EXTERNAL keyword optional, we will have reversed pros and cons above.

Any thoughts on adding EXTERNAL keyword, and make it required or optional?