Spark DataSource Option 参数

it2022-05-05 148

parquet

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

orc

https://spark.apache.org/docs/latest/sql-data-sources-orc.html

csv

https://docs.databricks.com/spark/latest/data-sources/read-csv.html#supported-options

/** * Loads CSV files and returns the result as a `DataFrame`. * * This function will go through the input once to determine the input schema if `inferSchema` * is enabled. To avoid going through the entire data once, disable `inferSchema` option or * specify the schema explicitly using `schema`. * * You can set the following CSV-specific options to deal with CSV files: * <ul> * <li>`sep` (default `,`): sets a single character as a separator for each * field and value.</li> * <li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding * type.</li> * <li>`quote` (default `"`): sets a single character used for escaping quoted values where * the separator can be part of the value. If you would like to turn off quotations, you need to * set not `null` but an empty string. This behaviour is different from * `com.databricks.spark.csv`.</li> * <li>`escape` (default `\`): sets a single character used for escaping quotes inside * an already quoted value.</li> * <li>`charToEscapeQuoteEscaping` (default `escape` or `\0`): sets a single character used for * escaping the escape for the quote character. The default value is escape character when escape * and quote characters are different, `\0` otherwise.</li> * <li>`comment` (default empty string): sets a single character used for skipping lines * beginning with this character. By default, it is disabled.</li> * <li>`header` (default `false`): uses the first line as names of columns.</li> * <li>`inferSchema` (default `false`): infers the input schema automatically from data. It * requires one extra pass over the data.</li> * <li>`ignoreLeadingWhiteSpace` (default `false`): a flag indicating whether or not leading * whitespaces from values being read should be skipped.</li> * <li>`ignoreTrailingWhiteSpace` (default `false`): a flag indicating whether or not trailing * whitespaces from values being read should be skipped.</li> * <li>`nullValue` (default empty string): sets the string representation of a null value. Since * 2.0.1, this applies to all supported types including the string type.</li> * <li>`nanValue` (default `NaN`): sets the string representation of a non-number" value.</li> * <li>`positiveInf` (default `Inf`): sets the string representation of a positive infinity * value.</li> * <li>`negativeInf` (default `-Inf`): sets the string representation of a negative infinity * value.</li> * <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. * Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to * date type.</li> * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type.</li> * <li>`maxColumns` (default `20480`): defines a hard limit of how many columns * a record can have.</li> * <li>`maxCharsPerColumn` (default `-1`): defines the maximum number of characters allowed * for any given value being read. By default, it is -1 meaning unlimited length</li> * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records * during parsing. It supports the following case-insensitive modes. * <ul> * <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a * field configured by `columnNameOfCorruptRecord`, and sets other fields to `null`. To keep * corrupt records, an user can set a string type field named `columnNameOfCorruptRecord` * in an user-defined schema. If a schema does not have the field, it drops corrupt records * during parsing. A record with less/more tokens than schema is not a corrupted record to * CSV. When it meets a record having fewer tokens than the length of the schema, sets * `null` to extra fields. When the record has more tokens than the length of the schema, * it drops extra tokens.</li> * <li>`DROPMALFORMED` : ignores the whole corrupted records.</li> * <li>`FAILFAST` : throws an exception when it meets corrupted records.</li> * </ul> * </li> * <li>`columnNameOfCorruptRecord` (default is the value specified in * `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string * created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li> * <li>`multiLine` (default `false`): parse one record, which may span multiple lines.</li> * </ul> * @since 2.0.0 */

text

属性名称默认值含义wholetextfalse默认情况下，文本文件中的每一行都是生成的DataFrame中的新行。如果为true，则将文件作为单行读取，而不是按“\ n”拆分。 * Usage example: * {{{ * spark.read.text("/path/to/spark/README.md") * }}}

jdbc

http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

属性名称默认值含义url要连接的JDBC URL，可以在 URL 中指定特定于源的连接属性。dbtable应该读取或写入的 JDBC 表query将数据读入 Spark 的查询语句driver“com.mysql.jdbc.Driver”用于连接到此 URL 的 JDBC 驱动程序的类名numPartitions表读取和写入中可用于并行的最大分区数，这同时确定了最大并发的JDBC连接数。partitionColumn, lowerBound, upperBound如果指定了任一选项，则必须指定全部选项。此外，还必须指定numPartitions。partitionColumn必须是表中的数字，日期或时间戳列。注意：lowerBound 和 upperBound 仅用于决定分区步幅，而不是用于过滤表中的行。因此，表中的所有行都将被分区并返回。这些选项仅适用于读操作。queryTimeout0超时时间（单位：秒），零意味着没有限制。fetchsize用于确定每次往返要获取的行数（例如，Oracle是10行），这可以用于提升 JDBC 驱动程序的性能。此选项仅适用于读。batchsize1000JDBC批处理大小，用于确定每次往返要插入的行数。这可以用于提升 JDBC 驱动程序的性能。此选项仅适用于写。isolationLevelREAD_UNCOMMITTED事务隔离级别，适用于当前连接。它可以是 NONE，READ_COMMITTED，READ_UNCOMMITTED，REPEATABLE_READ 或 SERIALIZABLE 之一，对应于 JDBC的Connection 对象定义的标准事务隔离级别，默认值为 READ_UNCOMMITTED。此选项仅适用于写。sessionInitStatement在向远程数据库打开每个数据库会话之后，在开始读取数据之前，此选项将执行自定义SQL语句（或PL / SQL块）。使用它来实现会话初始化，例如：option("sessionInitStatement", """BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;""")truncatefalse当启用SaveMode.Overwrite时，此选项会导致 Spark 截断现有表，而不是删除并重新创建它。这样更高效，并且防止删除表元数据（例如，索引）。但是，在某些情况下，例如新数据具有不同的 schema 时，它将无法工作。此选项仅适用于写。cascadeTruncatefalse如果JDBC数据库（目前为 PostgreSQL和Oracle）启用并支持，则此选项允许执行TRUNCATE TABLE t CASCADE（在PostgreSQL的情况下，仅执行TRUNCATE TABLE t CASCADE以防止无意中截断表）。这将影响其他表，因此应谨慎使用。此选项仅适用于写。createTableOptions此选项允许在创建表时设置特定于数据库的表和分区选项（例如，CREATE TABLE t (name string) ENGINE=InnoDB）。此选项仅适用于写。createTableColumnTypes创建表时要使用的数据库列数据类型而不是默认值。（例如：name CHAR（64），comments VARCHAR（1024））。指定的类型应该是有效的 spark sql 数据类型。此选项仅适用于写。customSchema用于从JDBC连接器读取数据的自定义 schema。例如，id DECIMAL(38, 0), name STRING。您还可以指定部分字段，其他字段使用默认类型映射。例如，id DECIMAL（38,0）。列名应与JDBC表的相应列名相同。用户可以指定Spark SQL的相应数据类型，而不是使用默认值。此选项仅适用于读。pushDownPredicatetrue用于启用或禁用谓词下推到 JDBC数据源的选项。默认值为 true，在这种情况下，Spark会尽可能地将过滤器下推到JDBC数据源。否则，如果设置为 false，则不会将过滤器下推到JDBC数据源，此时所有过滤器都将由Spark处理。

libsvm

https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html

对数据格式有要求，比如：

1 1:-0.222222 2:0.5 3:-0.762712 4:-0.833333 1 1:-0.555556 2:0.25 3:-0.864407 4:-0.916667 1 1:-0.722222 2:-0.166667 3:-0.864407 4:-0.833333 1 1:-0.722222 2:0.166667 3:-0.694915 4:-0.916667 0 1:0.166667 2:-0.416667 3:0.457627 4:0.5 1 1:-0.833333 3:-0.864407 4:-0.916667 2 1:-1.32455e-07 2:-0.166667 3:0.220339 4:0.0833333 2 1:-1.32455e-07 2:-0.333333 3:0.0169491 4:-4.03573e-08

第一列是标签，剩下的是特征

label index1:value1 index2:value2 … where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.

属性名称默认值含义numFeatures特征数：如果未指定或非正数，则将自动确定特征的数量，这会有额外的性能损耗。vectorTypesparse特征向量类型：sparse（稀疏）或者 dense（密集）。

image

https://spark.apache.org/docs/latest/ml-datasource#image-data-source

json

https://docs.databricks.com/spark/latest/data-sources/read-json.html

属性名称默认值含义primitivesAsStringfalse将所有原始类型推断为字符串类型prefersDecimalfalse将所有浮点类型推断为 decimal 类型，如果不适合，则推断为 double 类型allowCommentsfalse忽略JSON记录中的Java / C ++样式注释allowUnquotedFieldNamesfalse允许不带引号的JSON字段名称allowSingleQuotestrue除双引号外，还允许使用单引号allowNumericLeadingZerosfalse允许数字前有零allowBackslashEscapingAnyCharacterfalse允许反斜杠转义任何字符allowUnquotedControlCharsfalse允许JSON字符串包含不带引号的控制字符（值小于32的ASCII字符，包括制表符和换行符）或不包含。modePERMISSIVEPERMISSIVE：允许在解析过程中处理损坏记录； DROPMALFORMED：忽略整个损坏的记录；FAILFAST：遇到损坏的记录时抛出异常。columnNameOfCorruptRecordcolumnNameOfCorruptRecord（默认值是spark.sql.columnNameOfCorruptRecord中指定的值）：允许重命名由PERMISSIVE 模式创建的新字段（存储格式错误的字符串）。这会覆盖spark.sql.columnNameOfCorruptRecord。dateFormatdateFormat（默认yyyy-MM-dd）：设置表示日期格式的字符串。自定义日期格式遵循java.text.SimpleDateFormat中的格式。timestampFormattimestampFormat（默认yyyy-MM-dd'T'HH：mm：ss.SSSXXX）：设置表示时间戳格式的字符串。自定义日期格式遵循java.text.SimpleDateFormat中的格式。multiLinefalse解析可能跨越多行的一条记录

xml

https://github.com/databricks/spark-xml

读选项属性名称默认值含义path读文件路径rowTagROW处理的 xml文件的行标记。例如，在xml <books> <book> <book> ... </ books> 中，rowTag 是 book。samplingRatio1.0推断模式的采样率（0.0 ~ 1）。可能的类型是StructType，ArrayType，StringType，LongType，DoubleType，BooleanType，TimestampType 和 NullTypeexcludeAttributefalse是否要排除元素中的属性nullValue“null”读入空值的值，默认值为字符串 nullmodePERMISSIVEPERMISSIVE：允许在解析过程中处理损坏记录； DROPMALFORMED：忽略整个损坏的记录；FAILFAST：遇到损坏的记录时抛出异常。inferSchematrue如果为true，则尝试为每个生成的DataFrame列推断适当的类型，如布尔值，数字或日期类型。如果为false，则所有结果列都是字符串类型。columnNameOfCorruptRecord_corrupt_record存储格式错误字符串的新字段的名称attributePrefix_属性的前缀，以便我们可以区分属性和元素。这将是字段名称的前缀。valueTag_VALUE当元素中没有子元素的属性时，用于值的标记。charsetUTF-8编码ignoreSurroundingSpacesfalse定义是否应跳过正在读取的值的周围空格。写选项属性名称默认值含义path写文件路径rowTagROW处理的 xml文件的行标记。例如，在xml <books> <book> <book> ... </ books> 中，rowTag 是 book。rootTagROWS处理的xml文件的根标记。例如，在xml <books> <book> <book> ... </ books> 中，rootTag 是 books。nullValue“null”写入空值的值。默认值为字符串 null。如果为 null，则不会为字段写入属性和元素。attributePrefix_属性的前缀，以便我们可以区分属性和元素。这将是字段名称的前缀。valueTag_VALUE当元素中没有子元素的属性时，用于值的标记。compression保存到文件时使用的压缩编解码器。应该是实现 org.apache.hadoop.io.compress.CompressionCodec 的某个类的完全限定名，或者是一个不区分大小写的简写（bzip2，gzip，lz4 和 snappy）。未指定编解码器时，默认为无压缩。

excel

https://github.com/crealytics/spark-excel

专利

最新回复(0)