# RtoSQLServer
[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![R-CMD-check](https://github.com/DataScienceScotland/RtoSQLServer/workflows/R-CMD-check/badge.svg)](https://github.com/DataScienceScotland/RtoSQLServer/actions)
R package aiming to make importing of R dataframes into a MS SQL Server
database more robust.
## Installation
R package can be installed directly from Github or locally from zip.
To install directly from GitHub:
``` r
remotes::install_github("DataScienceScotland/rtosqlserver", upgrade = "never")
```
If the above does not work, install by downloading:
1. Download the [zip of the
repository](https://github.com/DataScienceScotland/RtoSQLServer/archive/refs/heads/main.zip)
from GitHub.
2. Save the downloaded zip to a specific directory (e.g.C:/temp).
3. Install with this command specifying the path to the downloaded zip:
``` r
remotes::install_local("C:/temp/RtoSQLServer-main.zip", upgrade="never")
```
## Functionality
As well as loading R dataframes into SQL Server databases, functions are
also currently available to:
- Read a database table into an R dataframe, optionally specifying a
subset of table columns or row filter.
- Drop a table from the database.
- Run any other input sql in the database and return a dataframe if a
select statement.
It is recommend to ensure using the latest versions of
[Rcpp](https://cran.r-project.org/web/packages/Rcpp/index.html),
[odbc](https://cran.r-project.org/web/packages/odbc/index.html) and
[DBI](https://cran.r-project.org/web/packages/DBI/index.html). If using
Windows in a secure environment install these from source or a Windows
binary compiled at the version of R you are using.
## Loading method used
When loading an R dataframe into SQL Server using
`write_dataframe_to_db`, following steps are followed:
1. The R dataframe is loaded into a staging table in the database in
batches of n rows at a time.
2. 1) If table of the specified name does NOT already exist in the
database schema:
1) Create target table in the database.
2) Insert all rows from staging table to target table.
3. 2) If table of same name does already exist in the database schema:
If ‘append_to_existing’=FALSE (this will result in an overwrite):
1) Drop the existing copy of the target table and create a new one
from staging table definition.
2) Insert all rows from staging table into target table.
If ‘append_to_existing’=TRUE:
1) Check that staging table columns and existing target table
columns are the same. If not, cancel loading and give a
warning.
2) If check passes, insert all rows from staging table into target
table.
4. Delete the staging table.
## Example Usage
Here is an example using the main functions:
``` r
# Make a test dataframe with n rows
test_n_rows <- 1234567
test_df <- data.frame(a = rep("a", test_n_rows), b = rep("b", test_n_rows))
# Set database connection details for use in functions:
server <- "server\\instance"
database <- "my_database_name"
schema <- "my_schema_name"
# Write the test dataframe to a SQL Server table in 100K batches (by default system versioning is FALSE)
write_dataframe_to_db(
server = server,
database = database,
schema = schema,
table_name = "test_r_tbl",
dataframe = test_df,
append_to_existing = FALSE,
batch_size = 1e5,
versioned_table = FALSE
)
# Read the SQL Server table into an R dataframe
read_df <- read_table_from_db(
server = server,
database = database,
schema = schema,
table_name = "test_r_tbl"
)
# Run other SQL for example select specific column
sql <- paste0("select a from ", schema, ".test_r_tbl")
read_selected_df <- execute_sql(
server = server,
database = database,
sql = sql,
output = TRUE
)
# SQL to return the names of all existing tables-----
# in the database into an R dataframe
sql2 <- "SELECT SCHEMA_NAME(t.schema_id) AS 'Schema',
t.name AS 'Name'
FROM sys.tables t
order by 1,2;"
db_all_tables <- execute_sql(
server = server,
database = database,
sql = sql2,
output = TRUE
)
# Drop a table from the database
drop_table_from_db(
server = server,
database = database,
schema = schema,
table_name = "test_r_tbl"
)
```
## Read database table with column or row filtering
The `read_table_from_db` function includes optional arguments so you can
read only specific columns into an R dataframe or so you can filter
which rows are read.
To select only specific columns, specify a vector of column_names for
example `c("Sepal_Length", "Species")` for the `columns` argument of
`read_table_from_db`.
To select only specific rows, specify a filter string using R syntax
(not SQL syntax) for the `filter_stmt` argument. For example
`"(Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0"`.
The function will convert this to SQL.
If you are unsure of column names in an existing database table, use the
`db_table_metadata` function to check.
Here’s a full example of both column and row selection using the
`columns` and `filter_stmt` arguments of `read_table_from_db` for a copy
of the iris data loaded into the database:
``` r
db_test_iris <- read_table_from_db(
server = server,
database = database,
schema = schema,
table_name = "test_iris",
columns = c("Sepal_Length", "Species"),
filter_stmt = "(Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0",
include_pk = FALSE
)
```
For the `filter_stmt` argument, this should be passed as a string. If
you find the syntax confusing when wrapping your filter within quotation
marks, then instead try using `deparse1(substitute())`. For example, the
above filter example could be passed as
`deparse1(substitute((Species == 'setosa' | Species == 'virginica') & Sepal_Length > 5.0))`.
If you are comfortable with SQL, you may prefer to use the `execute_sql`
function where a SQL select statement can be input to return an R
dataframe from a database.
## Background Information - Why use this package
The `RtoSQLServer` package relies on `DBI` and `odbc` packages for its
database functionality, however you should consider using it instead of
`DBI` or `odbc` for the following reasons:
#### Importing large dataframes
The main reason to use the package is to ensure the data loading process
is more robust than using `DBI` / `odbc` packages alone.
When loading large dataframes from R into MS SQL Server using `DBI` it
was found that the loading process will often hang and never complete.
People using R to load data into SQL Server may be working at home and
have varying quality of connection to the database server.
`RtoSQLServer` does two things to help with data loading:
1. Loading the dataframe into a staging table in the database first of
all. When the load into the staging table is complete, the target
table is created as a copy of the staging table. This helps to
prevent misleading problems of incomplete loads. The staging table
is dropped once the target table has been created successfully.
2. Loading in batches. The `RtoSQLServer` package’s
`write_dataframe_to_db` function includes a `batch_size` argument.
Batch size is the number of rows of the R dataframe that are loaded
into the database staging table at once. Using the batch loading
process in `RtoSQLServer` it has been possible to load large tables
it was not possible to load in one using `DBI`. The batch size is
set for the user at a default of 100K so the user does not need to
worry about a suitable batch size or splitting up the source
dataframe in R.
#### Self-contained database transactions
Unlike `DBI` and `odbc` where user must open a database connection
object in R and then pass this connection to functions (before hopefully
remembering to close it at the end of the R session), in `RtoSQLServer`
a database connection is established when calling each function and
closed at the end of that function. This means there are not hung
connections to the database, waiting until the user closes R.
The `RtoSQLServer` functions can be used within user custom functions as
self-contained tasks and do not rely on an open connection input
argument. This also means a user does not need to know how to setup a
connection, they simply pass the server and database name to the
function. This can make it easier to run code others have written.
Repeatedly opening and closing connections might make things slightly
slower, however it is thought this will not have much impact on the
workflow of the average R user where there will be only a few database
transactions in an R session.
#### Checking of table, column names and datatypes from R to SQL Server
The `DBI` and `odbc` packages are designed to work with many different
database engines. The `RtoSQLServer` package has focused on Microsoft
SQL Server. This allows checks to be written to ensure the table name is
compatible with SQL Server when being read by an ODBC driver.
The package maps R dataframe datatypes to specific SQL Server datatypes
ensuring consistent, predictable data storage. The current max length of
string in an R character column vector is used to select the size of the
SQL Server `nvarchar()` database column.
#### System Versioning
`RtoSQLServer` allows a user to create a table with MS SQL Server
[System
Versioning](https://docs.microsoft.com/en-us/sql/relational-databases/tables/creating-a-system-versioned-temporal-table?view=sql-server-ver15)
enabled. To do this using `RtoSQLServer` the `write_dataframe_to_db`
function is used with the `versioned_table` argument set to TRUE.
System Versioning adds date / timestamp (SQL Server datatype
`DATETIME2`) start and end columns to the table. A `
History`
table is created in the database schema, storing the history of previous
records following updates and deletes. This allows users to maintain an
archive of their table setup in R, without needing to know the SQL used
to establish System Versioning as part of a `create table` SQL
expression.
近期下载者:
相关文件:
收藏者: