Hostname: page-component-7c8c6479df-8mjnm Total loading time: 0 Render date: 2024-03-29T11:08:44.268Z Has data issue: false hasContentIssue false

Provenance as dependency analysis

Published online by Cambridge University Press:  27 October 2011

JAMES CHENEY
Affiliation:
Laboratory for Foundations of Computer Science, University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, Scotland Email: j.cheney@inf.ed.ac.uk
AMAL AHMED
Affiliation:
School of Informatics and Computing, Indiana University, 150 S. Woodlawn Ave., Bloomington, IN 47405, U.S.A. Email: amal@cs.indiana.edu
UMUT A. ACAR
Affiliation:
Max Planck Institute for Software Systems, Gottlieb-Daimler-Strasse, Building 49, D67663 Kaiserslautern, Germany Email: umut@mpi-sws.org

Abstract

Provenance is information recording the source, derivation or history of some information. Provenance tracking has been studied in a variety of settings, particularly database management systems. However, although many candidate definitions of provenance have been proposed, the mathematical or semantic foundations of data provenance have received comparatively little attention. In this paper, we argue that dependency analysis techniques familiar from program analysis and program slicing provide a formal foundation for forms of provenance that are intended to show how (part of) the output of a query depends on (parts of) its input. We introduce a semantic characterisation of such dependency provenance for a core database query language, show that minimal dependency provenance is not computable, and provide dynamic and static approximation techniques. We also discuss preliminary implementation experience with using dependency provenance to compute data slices, or summaries of the parts of the input relevant to a given part of the output.

Type
Paper
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abadi, M., Banerjee, A., Heintze, N. and Riecke, J. G. (1999) A core calculus of dependency. In: POPL '99: Proceedings 26th ACM Symposium on Principles of Programming Languages, ACM Press 147160.CrossRefGoogle Scholar
Abadi, M., Lampson, B. and Lévy, J.-J. (1996) Analysis and caching of dependencies. In: Proceedings of the first ACM SIGPLAN International Conference on Functional Programming: ICFP, ACM Press 8391.CrossRefGoogle Scholar
Abiteboul, S., Hull, R. and Vianu, V. (1995) Foundations of Databases, Addison-Wesley.Google Scholar
Acar, U. A. (2009) Self-adjusting computation: (an overview). In: PEPM '09: Proceedings of the 2009 ACM SIGPLAN workshop on Partial Evaluation and Program Manipulation, ACM Press 16.Google Scholar
Acar, U. A., Ahmed, A. and Blume, M. (2008) Imperative self-adjusting computation. In: POPL '08: Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, ACM Press 309322.CrossRefGoogle Scholar
Acar, U. A., Blelloch, G. E. and Harper, R. (2003) Selective memoization. In: POPL '03: Proceedings of the 30th Annual ACM Symposium on Principles of Programming Languages, ACM Press 1425.CrossRefGoogle Scholar
Benjelloun, O., Sarma, A. D., Halevy, A. Y. and Widom, J. (2006) ULDBs: Databases with uncertainty and lineage. In: Proceedings of VLDB'2006, VLDB 953964.Google Scholar
Bhagwat, D., Chiticariu, L., Tan, W.-C. and Vijayvargiya, G. (2005) An annotation management system for relational databases. VLDB Journal 14 (4)373396.CrossRefGoogle Scholar
Biswas, S. (1997) Dynamic Slicing in Higher-Order Programming Languages, Ph.D. thesis, University of Pennsylvania.Google Scholar
Bose, R. and Frew, J. (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37 (1)128.CrossRefGoogle Scholar
Buneman, P., Chapman, A. and Cheney, J. (2006) Provenance management in curated databases. In: SIGMOD 2006, ACM Press 539550.Google Scholar
Buneman, P., Cheney, J., Tan, W.-C. and Vansummeren, S. (2008a) Curated databases. Invited paper in: Proceedings of the 2008 Symposium on Principles of Database Systems (PODS 2008) 1–12.CrossRefGoogle Scholar
Buneman, P., Cheney, J. and Vansummeren, S. (2008b) On the expressiveness of implicit provenance in query and update languages. ACM Transactions on Database Systems 33 (4)28.CrossRefGoogle Scholar
Buneman, P., Khanna, S. and Tan, W. (2001) Why and where: A characterization of data provenance. In: Proceedings ICDT 2001. Springer-Verlag Lecture Notes in Computer Science 1973 316330.CrossRefGoogle Scholar
Buneman, P., Khanna, S. and Tan, W. (2002) On propagation of deletions and annotations through views. In: PODS'02 Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press 150158.CrossRefGoogle Scholar
Buneman, P., Libkin, L., Suciu, D., Tannen, V. and Wong, L. (1994) Comprehension syntax. SIGMOD Record 23 (1)8796.CrossRefGoogle Scholar
Buneman, P., Naqvi, S. A., Tannen, V. and Wong, L. (1995) Principles of programming with complex objects and collection types. Theor. Comp. Sci. 149 (1)348.CrossRefGoogle Scholar
Cheney, J. (2007) Program slicing and data provenance. IEEE Data Engineering Bulletin 22–28.Google Scholar
Cheney, J., Ahmed, A. and Acar, U. A. (2007) Provenance as dependency analysis. In: Arenas, M. and Schwartzbach, M. I. (eds.) Proceedings of the 11th International Symposium on Database Programming Languages (DBPL 2007). Springer-Verlag Lecture Notes in Computer Science 4797 139153.Google Scholar
Cheney, J., Chiticariu, L. and Tan, W. C. (2009) Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1 (4)379474.CrossRefGoogle Scholar
Cousot, P. and Cousot, R. (1977) Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: POPL '77: Conference Record of the Fourth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM Press 238252.CrossRefGoogle Scholar
Cui, Y., Widom, J. and Wiener, J. L. (2000) Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 25 (2)179227.CrossRefGoogle Scholar
Field, J. and Tip, F. (1998) Dynamic dependence in term rewriting systems and its application to program slicing. Information and Software Technology 40 (11–12)609636.CrossRefGoogle Scholar
Foster, I. and Moreau, L. (eds.) (2006) Proceedings of the 2006 International Provenance and Annotation Workshop (IPAW 2006). Springer-Verlag Lecture Notes in Computer Science 4145.Google Scholar
Foster, J. N., Green, T. J. and Tannen, V. (2008) Annotated XML: queries and provenance. In: Proceedings of the 2008 Symposium on Principles of Database Systems (PODS 2008), ACM Press 271280.Google Scholar
Geerts, F., Kementsietsidis, A. and Milano, D. (2006) Mondrian: Annotating and querying databases through colors and blocks. In: Proceedings of the 22nd International Conference on Data Engineering: ICDE 2006, IEEE Computer Society 82.Google Scholar
Green, T. J., Karvounarakis, G. and Tannen, V. (2007) Provenance semirings. In: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '07), ACM Press 3140.CrossRefGoogle Scholar
Hidders, J., Kwasnikowska, N., Sroka, J., Tyszkiewicz, J. and den Bussche, J. V. (2007) A formal model of dataflow repositories. In: Boulakia, S. C. and Tannen, V. (eds.) Data Integration in the Life Sciences, Proceedings 4th International Workshop, DILS 2007. Springer-Verlag Lecture Notes in Computer Science 4544 105121.CrossRefGoogle Scholar
Jia, L. et al. (2008) AURA: a programming language for authorization and audit. In: ICFP '08: Proceedings of the 13th ACM SIGPLAN international conference on Functional programming, ACM Press 2738.CrossRefGoogle Scholar
Lynch, C. (2000) Authenticity and integrity in the digital environment: An exploratory analysis of the central role of trust. In: Authenticity in the Digital Environment, CLIR Report pub92, CLIR.Google Scholar
Moreau, L. et al. (2007) The First Provenance Challenge. Concurrency and Computation: Practice and Experience 20 (5)409418.CrossRefGoogle Scholar
Muniswamy-Reddy, K.-K., Holland, D. A., Braun, U. and Seltzer, M. (2006) Provenance-aware storage systems. In: Annual Tech 06: 2006 USENIX Annual Technical Conference, USENIX Association 43–56.Google Scholar
Myers, A. C. (1999) Jflow: practical mostly-static information flow control. In: POPL '99: Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, ACM Press 228241.CrossRefGoogle Scholar
Nielson, F., Nielson, H. R. and Hankin, C. (2005) Principles of Program Analysis, second editionSpringer-Verlag.Google Scholar
Palsberg, J. (2001) Type-based analysis and applications. In: PASTE '01: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program Analysis for Software Tools and Engineering, ACM Press 2027.CrossRefGoogle Scholar
Sabelfeld, A. and Myers, A. (2003) Language-based information-flow security. IEEE Journal on Selected Areas in Communications 21 (1)519.CrossRefGoogle Scholar
Shroff, P., Smith, S. F. and Thober, M. (2008) Securing information flow via dynamic capture of dependencies. J. Comput. Secur. 16 (5)637688.CrossRefGoogle Scholar
Simmhan, Y., Plale, B. and Gannon, D. (2005) A survey of data provenance in e-science. SIGMOD Record 34 (3)3136.CrossRefGoogle Scholar
Swamy, N., Corcoran, B. J. and Hicks, M. (2008) Fable: A language for enforcing user-defined security policies. In: IEEE Symposium on Security and Privacy, IEEE Computer Society 369383.Google Scholar
Swamy, N., Hicks, M. W. and Bierman, G. M. (2009) A theory of typed coercions and its applications. In: Hutton, G. and Tolmach, A. P. (eds.) Proceedings 14th ACM SIGPLAN International Conference on Functional Programming (ICFP 2009), ACM Press 329340.CrossRefGoogle Scholar
Volpano, D., Irvine, C. and Smith, G. (1996) A sound type system for secure flow analysis. J. Comput. Secur. 4 (2–3)167187.CrossRefGoogle Scholar
Wadler, P. (1992) Comprehending monads. Mathematical Structures in Computer Science 2 461493.CrossRefGoogle Scholar
Wang, Y. R. and Madnick, S. E. (1990) A polygen model for heterogeneous database systems: The source tagging perspective. In: Proceedings of the sixteenth international conference on Very large databases, Morgan Kaufmann 519538.Google Scholar
Weiser, M. (1981) Program slicing. In: ICSE: Proceedings of the 5th International Conference on Software Engineering, IEEE Computer Society 439449.Google Scholar
Wong, L. (1996) Normal forms and conservative extension properties for query languages over collection types. Journal of Computer and System Sciences 52 (3)495505.CrossRefGoogle Scholar
Woodruff, A. and Stonebraker, M. (1997) Supporting fine-grained data lineage in a database visualization environment. In: ICDE 1997: Proceedings of the Thirteenth International Conference on Data Engineering, IEEE Computer Society 91102.Google Scholar