benford
所属分类:数据挖掘/数据仓库
开发工具:R
文件大小:3021KB
下载次数:0
上传日期:2019-09-07 13:22:24
上 传 者:
sh-1993
说明: 使数据验证和法医分析更容易使用本福德定律的工具。
(Tools that make it easier to use Benford’s law for data validation and forensic analytics.)
文件列表:
.Rbuildignore (126, 2019-08-24)
.Rprofile (47, 2019-08-24)
.travis.yml (144, 2019-08-24)
DESCRIPTION (515, 2019-08-24)
NAMESPACE (894, 2019-08-24)
R (0, 2019-08-24)
R\data.documentation.R (5701, 2019-08-24)
R\functions-new.R (19305, 2019-08-24)
R\get.functions.R (10649, 2019-08-24)
R\internal.functions-new-code-2.R (21104, 2019-08-24)
appveyor.yml (1091, 2019-08-24)
benford.analysis.Rproj (386, 2019-08-24)
cran-comments.md (370, 2019-08-24)
data (0, 2019-08-24)
data\census.2000_2010.rda (39708, 2019-08-24)
data\census.2009.rda (120638, 2019-08-24)
data\corporate.payment.rda (911512, 2019-08-24)
data\datalist (184, 2019-08-24)
data\fibonacci.500.rda (3990, 2019-08-24)
data\gm.payments.rda (48324, 2019-08-24)
data\journal.entry.rda (114975, 2019-08-24)
data\lakes.perimeter.rda (5956, 2019-08-24)
data\madoff.returns.rda (771, 2019-08-24)
data\purchasing.cards.2010.rda (1197890, 2019-08-24)
data\sino.forest.rda (3045, 2019-08-24)
data\streamflow.rda (52991, 2019-08-24)
data\taxable.incomes.1978.rda (340016, 2019-08-24)
man (0, 2019-08-24)
man\MAD.Rd (559, 2019-08-24)
man\MAD.conformity.Rd (824, 2019-08-24)
man\benford.Rd (4500, 2019-08-24)
man\benford.analysis.Rd (2468, 2019-08-24)
man\census.2000_2010.Rd (518, 2019-08-24)
man\census.2009.Rd (549, 2019-08-24)
man\chisq.Rd (632, 2019-08-24)
man\corporate.payment.Rd (553, 2019-08-24)
... ...
## benford.analysis
[![Travis-CI Build
Status](https://travis-ci.org/carloscinelli/benford.analysis.svg?branch=master)](https://travis-ci.org/carloscinelli/benford.analysis)
[![Build
status](https://ci.appveyor.com/api/projects/status/igyn1737s67jqqnb/branch/master?svg=true)](https://ci.appveyor.com/project/carloscinelli/benford-analysis/branch/master)
[![Coverage
Status](https://img.shields.io/codecov/c/github/carloscinelli/benford.analysis/master.svg)](https://codecov.io/github/carloscinelli/benford.analysis?branch=master)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/benford.analysis)](https://cran.r-project.org/package=benford.analysis)
![](http://cranlogs.r-pkg.org/badges/benford.analysis)
The Benford Analysis (`benford.analysis`) package provides tools that
make it easier to validate data using Benford’s Law. The main purpose of
the package is to identify suspicious data that need further
verification.
## CRAN
You can install the package from CRAN by running:
``` r
install.packages("benford.analysis")
```
## How to install the development version from GitHub
To install the GitHub version you need to have the package `devtools`
installed. Make sure to set the option `build_vignettes = TRUE` to
compile the package
vignette.
``` r
# install.packages("devtools") # run this to install the devtools package
devtools::install_github("carloscinelli/benford.analysis", build_vignettes = TRUE)
```
## Example usage
The `benford.analysis` package comes with 6 real datasets from Mark
Nigrini’s book [Benford’s Law: Applications for Forensic Accounting,
Auditing, and Fraud
Detection](http://www.amazon.com/gp/product/B007KG9ZAI/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B007KG9ZAI&linkCode=as2&tag=analreal-20).
Here we will give an example using 189.470 records from the corporate
payments data. First we need to load the package and the data:
``` r
library(benford.analysis) # loads package
data(corporate.payment) # loads data
```
Then to validade the data against Benford’s law you simply use the
function `benford` in the appropriate column:
``` r
bfd.cp <- benford(corporate.payment$Amount)
```
The command above created an object of class “Benford” with the results
for the analysis using the first two significant digits. You can choose
a different number of digits changing the `number.of.digits` paramater.
For more information and parameters see `?benford`:
Let’s check the main plots of the analysis:
``` r
plot(bfd.cp)
```
![](tools/unnamed-chunk-6-1.png)
The original data is in blue and the expected frequency according to
Benford’s law is in red. For instance, in our example, the first plot
shows that the data do have a tendency to follow Benford’s law, but also
that there is a clear discrepancy at 50.
You can print the main results of the analysis:
``` r
bfd.cp
#>
#> Benford object:
#>
#> Data: corporate.payment$Amount
#> Number of observations used = 185083
#> Number of obs. for second order = 65504
#> First digits analysed = 2
#>
#> Mantissa:
#>
#> Statistic Value
#> Mean 0.496
#> Var 0.092
#> Ex.Kurtosis -1.257
#> Skewness -0.002
#>
#>
#> The 5 largest deviations:
#>
#> digits absolute.diff
#> 1 50 5938.25
#> 2 11 3331.***
#> 3 10 2811.92
#> 4 14 1043.68
#> 5 *** 889.95
#>
#> Stats:
#>
#> Pearson's Chi-squared test
#>
#> data: corporate.payment$Amount
#> X-squared = 32094, df = 89, p-value < 2.2e-16
#>
#>
#> Mantissa Arc Test
#>
#> data: corporate.payment$Amount
#> L2 = 0.0039958, df = 2, p-value < 2.2e-16
#>
#>
#> Kolmogorov-Smirnov test
#>
#> data: corporate.payment$Amount
#> D = 0.033195, critical value = 0.0031612
#>
#> Mean Absolute Deviation (MAD): 0.002336614
#> MAD Conformity - Nigrini (2012): Nonconformity
#> Distortion Factor: -1.065467
#>
#> Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
```
The print method first shows the general information of the analysis,
like the name of the data used, the number of observations used and how
many significant digits were analyzed.
After that you have the main statistics of the log mantissa of the data.
If the data follows Benford’s Law, the numbers should be close to:
| Statistic | Value |
| ------------ | --------------- |
| Mean | 0.5 |
| Variance | 1/12 (0.08333...) |
| Ex. Kurtosis | \-1.2 |
| Skewness | 0 |
Printing also shows the 5 largest discrepancies. Notice that, as we had
seen on the plot, the highest deviation is 50. These deviations are good
candidates for closer inspections. It also shows the results of
statistical tests like the Chi-squared test and the Mantissa Arc test.
The package provides some helper functions to further investigate the
data. For example, you can easily extract the observations with the
largest discrepancies by using the `getSuspects` function.
``` r
suspects <- getSuspects(bfd.cp, corporate.payment)
suspects
#> Warning in format.POSIXlt(as.POSIXlt(x), ...): unknown timezone 'zone/tz/
#> 2019b.1.0/zoneinfo/America/Los_Angeles'
#> VendorNum Date InvNum Amount
#> 1: 2001 2010-01-02 3822J10 50.38
#> 2: 2001 2010-01-07 100107-2 1166.29
#> 3: 2001 2010-01-08 11210084007 1171.45
#> 4: 2001 2010-01-08 1585J10 50.42
#> 5: 2001 2010-01-08 4733J10 113.34
#> ---
#> 17852: 52867 2010-07-01 270358343233 11.58
#> 17853: 52870 2010-02-01 270682253025 11.20
#> 17854: 52904 2010-06-01 271866383919 50.15
#> 17855: 52911 2010-02-01 270957401515 11.20
#> 17856: 52934 2010-02-01 271745237617 11.88
```
More information can be found on the help documentation and examples.
The vignette will be ready soon.
近期下载者:
相关文件:
收藏者: