The histograms are the simplest and, sometimes, the most effective tool to describe quickly a dataset density. Suppose you have an independent and identically distributed random sample from some unknown continuous distribution called . Recall in a past post , we explained the histogram construction. We found the following histogram estimator for ,
where the are a uniform partition of size of the real line. Then, continuing with the histogram series ( II and III ), we explored its asymptotic properties. Finally, we illustrated the theory with some simulated data. Specifically in this post , we adjust the binwidth minimizing the mean integrated squared error (MISE) and getting
Notice bazar the bindwidth depends of the unknown quantity , thus the main problem remains unsolved and we cannot use it in practice. Yeah I know!, I cheated before minimizing the MISE and forgetting constant influence constant. In fact, all worked well because my example bazar was a normal distributed with a relative large sample. Of course, those conditions are rare in statistics and we need improve the choice of . In this post, we will find a fully data-driven estimator for using a technique called cross-validation . Define the integrated squared error as follows, bazar
Remark that we can estimate the first term using only available sample. However, the second one depends on the unknown function bazar . The first thought that comes in mind to approximate is to use the empirical estimator
Here, is the leave-one-out estimator. You guessed right! We have removed the sample in each evaluation to ensure the independence between and the ‘s. In fact, we can prove that (e.g., \cite{tsybakov2008introduction}). Then, the general criterion to find by cross-validation is,
The last equation looks ugly and trying to minimize it in this state, seems futile. Given that we are working—for now—with the histogram case, we can simplify it even further and find something easier to estimate. First, denote by the number of observations belonging to the interval . The random variable has the form
We have achieved our main goal on finding a statistic (a formula which depends only in the data) that we can minimize it numerically to find the optimal bindwidth bazar . Related articles Calculating histograms (gnuplotting.org) Calibration Affirmation (r-bloggers.com) bazar How many bins? (luisospina.wordpress.com) Python: Numerical Descriptions of the Data (statsblogs.com)
Previous Post If you love li…
Search for: Archives Select Month April 2014 February 2014 January 2014 December 2013 September 2013 August 2013 March 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 January 2012 December 2011 November 2011 October 2011
Dropbox Mendeley Rescue Time My latest posts Optimizing the binwidth for the histogram using cross-validation April 26, 2014 If you love li… February 9, 2014 Procrastinatio… February 8, 2014 The most viewed
Density Estimation by Histograms (Part IV) I heard some voices… bazar John on Introduction to Minimax Lower… varasdemate on Project Euler: Problem #2… karavindkumar1993 on Project Euler: Problem #2… Optimizing the binwi… on Density Estimation by Histogra… Optimizing the binwi… on Density Estimation by Histogra… Optimizing the binwi… on Density Estimation bazar by Histogra… Optimizing the binwi… on Density Estimation by Histogra… Paper’s review… on JdS 2012: Efficient estimation… Maikol bazar Solís on The return bazar J. Sánchez on The return Never miss a post!
%d bloggers like this:
No comments:
Post a Comment