RtoSQLServer

所属分类:uCOS/RTOS
开发工具:R
文件大小:0KB
下载次数:0
上传日期:2024-01-12 08:34:59
上 传 者sh-1993
说明:  RtoSQL服务器
(RtoSQLServer)

文件列表:
R/
demo/
man/
tests/
.Rbuildignore
.lintr
DESCRIPTION
LICENSE
NAMESPACE
RtoSQLServer.Rproj

# RtoSQLServer [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental) [![R-CMD-check](https://github.com/DataScienceScotland/RtoSQLServer/workflows/R-CMD-check/badge.svg)](https://github.com/DataScienceScotland/RtoSQLServer/actions) R package aiming to make importing of R dataframes into a MS SQL Server database more robust. ## Installation R package can be installed directly from Github or locally from zip. To install directly from GitHub: ``` r remotes::install_github("DataScienceScotland/rtosqlserver", upgrade = "never") ``` If the above does not work, install by downloading: 1. Download the [zip of the repository](https://github.com/DataScienceScotland/RtoSQLServer/archive/refs/heads/main.zip) from GitHub. 2. Save the downloaded zip to a specific directory (e.g.C:/temp). 3. Install with this command specifying the path to the downloaded zip: ``` r remotes::install_local("C:/temp/RtoSQLServer-main.zip", upgrade="never") ``` ## Functionality As well as loading R dataframes into SQL Server databases, functions are also currently available to: - Read a database table into an R dataframe, optionally specifying a subset of table columns or row filter. - Drop a table from the database. - Run any other input sql in the database and return a dataframe if a select statement. It is recommend to ensure using the latest versions of [Rcpp](https://cran.r-project.org/web/packages/Rcpp/index.html), [odbc](https://cran.r-project.org/web/packages/odbc/index.html) and [DBI](https://cran.r-project.org/web/packages/DBI/index.html). If using Windows in a secure environment install these from source or a Windows binary compiled at the version of R you are using. ## Loading method used When loading an R dataframe into SQL Server using `write_dataframe_to_db`, following steps are followed: 1. The R dataframe is loaded into a staging table in the database in batches of n rows at a time. 2. 1) If table of the specified name does NOT already exist in the database schema: 1) Create target table in the database. 2) Insert all rows from staging table to target table. 3. 2) If table of same name does already exist in the database schema: If ‘append_to_existing’=FALSE (this will result in an overwrite): 1) Drop the existing copy of the target table and create a new one from staging table definition. 2) Insert all rows from staging table into target table. If ‘append_to_existing’=TRUE: 1) Check that staging table columns and existing target table columns are the same. If not, cancel loading and give a warning. 2) If check passes, insert all rows from staging table into target table. 4. Delete the staging table. ## Example Usage Here is an example using the main functions: ``` r # Make a test dataframe with n rows test_n_rows <- 1234567 test_df <- data.frame(a = rep("a", test_n_rows), b = rep("b", test_n_rows)) # Set database connection details for use in functions: server <- "server\\instance" database <- "my_database_name" schema <- "my_schema_name" # Write the test dataframe to a SQL Server table in 100K batches (by default system versioning is FALSE) write_dataframe_to_db( server = server, database = database, schema = schema, table_name = "test_r_tbl", dataframe = test_df, append_to_existing = FALSE, batch_size = 1e5, versioned_table = FALSE ) # Read the SQL Server table into an R dataframe read_df <- read_table_from_db( server = server, database = database, schema = schema, table_name = "test_r_tbl" ) # Run other SQL for example select specific column sql <- paste0("select a from ", schema, ".test_r_tbl") read_selected_df <- execute_sql( server = server, database = database, sql = sql, output = TRUE ) # SQL to return the names of all existing tables----- # in the database into an R dataframe sql2 <- "SELECT SCHEMA_NAME(t.schema_id) AS 'Schema', t.name AS 'Name' FROM sys.tables t order by 1,2;" db_all_tables <- execute_sql( server = server, database = database, sql = sql2, output = TRUE ) # Drop a table from the database drop_table_from_db( server = server, database = database, schema = schema, table_name = "test_r_tbl" ) ``` ## Read database table with column or row filtering The `read_table_from_db` function includes optional arguments so you can read only specific columns into an R dataframe or so you can filter which rows are read. To select only specific columns, specify a vector of column_names for example `c("Sepal_Length", "Species")` for the `columns` argument of `read_table_from_db`. To select only specific rows, specify a filter string using R syntax (not SQL syntax) for the `filter_stmt` argument. For example `"(Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0"`. The function will convert this to SQL. If you are unsure of column names in an existing database table, use the `db_table_metadata` function to check. Here’s a full example of both column and row selection using the `columns` and `filter_stmt` arguments of `read_table_from_db` for a copy of the iris data loaded into the database: ``` r db_test_iris <- read_table_from_db( server = server, database = database, schema = schema, table_name = "test_iris", columns = c("Sepal_Length", "Species"), filter_stmt = "(Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0", include_pk = FALSE ) ``` For the `filter_stmt` argument, this should be passed as a string. If you find the syntax confusing when wrapping your filter within quotation marks, then instead try using `deparse1(substitute())`. For example, the above filter example could be passed as `deparse1(substitute((Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0))`. If you are comfortable with SQL, you may prefer to use the `execute_sql` function where a SQL select statement can be input to return an R dataframe from a database. ## Background Information - Why use this package The `RtoSQLServer` package relies on `DBI` and `odbc` packages for its database functionality, however you should consider using it instead of `DBI` or `odbc` for the following reasons: #### Importing large dataframes The main reason to use the package is to ensure the data loading process is more robust than using `DBI` / `odbc` packages alone. When loading large dataframes from R into MS SQL Server using `DBI` it was found that the loading process will often hang and never complete. People using R to load data into SQL Server may be working at home and have varying quality of connection to the database server. `RtoSQLServer` does two things to help with data loading: 1. Loading the dataframe into a staging table in the database first of all. When the load into the staging table is complete, the target table is created as a copy of the staging table. This helps to prevent misleading problems of incomplete loads. The staging table is dropped once the target table has been created successfully. 2. Loading in batches. The `RtoSQLServer` package’s `write_dataframe_to_db` function includes a `batch_size` argument. Batch size is the number of rows of the R dataframe that are loaded into the database staging table at once. Using the batch loading process in `RtoSQLServer` it has been possible to load large tables it was not possible to load in one using `DBI`. The batch size is set for the user at a default of 100K so the user does not need to worry about a suitable batch size or splitting up the source dataframe in R. #### Self-contained database transactions Unlike `DBI` and `odbc` where user must open a database connection object in R and then pass this connection to functions (before hopefully remembering to close it at the end of the R session), in `RtoSQLServer` a database connection is established when calling each function and closed at the end of that function. This means there are not hung connections to the database, waiting until the user closes R. The `RtoSQLServer` functions can be used within user custom functions as self-contained tasks and do not rely on an open connection input argument. This also means a user does not need to know how to setup a connection, they simply pass the server and database name to the function. This can make it easier to run code others have written. Repeatedly opening and closing connections might make things slightly slower, however it is thought this will not have much impact on the workflow of the average R user where there will be only a few database transactions in an R session. #### Checking of table, column names and datatypes from R to SQL Server The `DBI` and `odbc` packages are designed to work with many different database engines. The `RtoSQLServer` package has focused on Microsoft SQL Server. This allows checks to be written to ensure the table name is compatible with SQL Server when being read by an ODBC driver. The package maps R dataframe datatypes to specific SQL Server datatypes ensuring consistent, predictable data storage. The current max length of string in an R character column vector is used to select the size of the SQL Server `nvarchar()` database column. #### System Versioning `RtoSQLServer` allows a user to create a table with MS SQL Server [System Versioning](https://docs.microsoft.com/en-us/sql/relational-databases/tables/creating-a-system-versioned-temporal-table?view=sql-server-ver15) enabled. To do this using `RtoSQLServer` the `write_dataframe_to_db` function is used with the `versioned_table` argument set to TRUE. System Versioning adds date / timestamp (SQL Server datatype `DATETIME2`) start and end columns to the table. A `History` table is created in the database schema, storing the history of previous records following updates and deletes. This allows users to maintain an archive of their table setup in R, without needing to know the SQL used to establish System Versioning as part of a `create table` SQL expression.
近期下载者

相关文件


收藏者