From 9a6d6e79f700e33051063dec3a80be9a79095e19 Mon Sep 17 00:00:00 2001 From: yury Date: Sun, 20 Aug 2006 13:13:22 +0000 Subject: [PATCH] - merge with 1_5,some fixes. --- .../patches/ext3-check-jbd-errors-2.6.5.patch | 101 + .../patches/ext3-check-jbd-errors-2.6.9.patch | 101 + .../patches/ext3-ea-in-inode-2.6-rhel4.patch | 10 +- .../patches/ext3-ea-in-inode-2.6-suse.patch | 10 +- .../patches/ext3-extents-2.6.12.patch | 34 +- .../patches/ext3-extents-2.6.15.patch | 2933 ++++++++++++ .../patches/ext3-extents-2.6.18-vanilla.patch | 2935 ++++++++++++ .../patches/ext3-extents-2.6.5.patch | 34 +- .../patches/ext3-extents-2.6.9-rhel4.patch | 34 +- .../patches/ext3-filterdata-2.6.15.patch | 25 + .../patches/ext3-lookup-dotdot-2.6.9.patch | 63 + .../patches/ext3-mballoc2-2.6-fc5.patch | 2779 +++++++++++ .../patches/ext3-mballoc2-2.6-suse.patch | 552 ++- .../patches/ext3-mballoc2-2.6.12.patch | 540 ++- .../patches/ext3-mballoc2-2.6.18-vanilla.patch | 2810 ++++++++++++ .../patches/ext3-mballoc2-2.6.9-rhel4.patch | 556 ++- .../patches/ext3-sector_t-overflow-2.6.12.patch | 64 + .../ext3-sector_t-overflow-2.6.5-suse.patch | 44 + .../ext3-sector_t-overflow-2.6.9-rhel4.patch | 64 + .../patches/ext3-wantedi-2.6-rhel4.patch | 7 +- .../patches/ext3-wantedi-2.6-suse.patch | 7 +- ldiskfs/kernel_patches/patches/iopen-2.6-fc5.patch | 448 ++ .../kernel_patches/series/ldiskfs-2.6-fc3.series | 22 + .../kernel_patches/series/ldiskfs-2.6-fc5.series | 12 + .../kernel_patches/series/ldiskfs-2.6-rhel4.series | 3 + .../kernel_patches/series/ldiskfs-2.6-suse.series | 3 + .../series/ldiskfs-2.6.12-vanilla.series | 2 + .../series/ldiskfs-2.6.18-vanilla.series | 13 + lustre/ChangeLog | 363 +- lustre/autoMakefile.am | 33 +- lustre/autoconf/lustre-core.m4 | 129 +- lustre/contrib/.cvsignore | 2 + lustre/contrib/Makefile.am | 5 + lustre/contrib/README | 2 + lustre/contrib/mpich-1.2.6-lustre.patch | 1829 ++++++++ lustre/doc/Makefile.am | 6 +- lustre/doc/lconf.8 | 91 +- lustre/doc/lctl.8 | 303 +- lustre/doc/lfs.1 | 66 +- lustre/doc/lfs.lyx | 65 +- lustre/doc/llverdev.txt | 48 + lustre/doc/llverfs.txt | 48 + lustre/doc/lmc.1 | 2 +- lustre/doc/lmc.lyx | 3 +- lustre/doc/lustre.7 | 76 + lustre/doc/mkfs.lustre.8 | 129 + lustre/doc/mount.lustre.8 | 105 + lustre/doc/tunefs.lustre.8 | 92 + lustre/fid/fid_handler.c | 15 +- lustre/fid/fid_internal.h | 2 +- lustre/fid/fid_request.c | 10 +- lustre/fld/fld_cache.c | 6 +- lustre/fld/fld_handler.c | 5 +- lustre/fld/fld_internal.h | 2 +- lustre/fld/fld_request.c | 12 +- lustre/include/Makefile.am | 4 +- lustre/include/liblustre.h | 10 +- lustre/include/linux/Makefile.am | 2 +- lustre/include/linux/lustre_acl.h | 16 +- lustre/include/linux/lustre_compat25.h | 43 +- lustre/include/linux/lustre_dlm.h | 4 + lustre/include/linux/lustre_fsfilt.h | 32 +- lustre/include/linux/lustre_intent.h | 35 + lustre/include/linux/lustre_lite.h | 3 - lustre/include/linux/lustre_mds.h | 12 +- lustre/include/linux/lustre_patchless_compat.h | 78 + lustre/include/linux/lustre_types.h | 8 +- lustre/include/linux/lustre_user.h | 2 + lustre/include/linux/lvfs_linux.h | 1 + lustre/include/lprocfs_status.h | 6 + lustre/include/lustre/liblustreapi.h | 42 +- lustre/include/lustre/lustre_idl.h | 249 +- lustre/include/lustre/lustre_user.h | 27 +- lustre/include/lustre_cfg.h | 6 +- lustre/include/lustre_disk.h | 59 +- lustre/include/lustre_dlm.h | 169 +- lustre/include/lustre_export.h | 3 + lustre/include/lustre_import.h | 28 +- lustre/include/lustre_lib.h | 14 +- lustre/include/lustre_lite.h | 1 - lustre/include/lustre_mdc.h | 1 + lustre/include/lustre_mds.h | 19 +- lustre/include/lustre_net.h | 164 +- lustre/include/lustre_param.h | 29 +- lustre/include/lustre_req_layout.h | 1 + lustre/include/obd.h | 253 +- lustre/include/obd_class.h | 276 +- lustre/include/obd_ost.h | 12 +- lustre/include/obd_support.h | 9 +- .../kernel-2.4.21-rhel-2.4-i686-smp.config | 6 +- .../kernel-2.4.21-rhel-2.4-i686.config | 4 + .../kernel-2.4.21-rhel-2.4-ia64-smp.config | 4 + .../kernel-2.4.21-rhel-2.4-ia64.config | 4 + .../kernel-2.4.21-rhel-2.4-x86_64-smp.config | 7 +- .../kernel-2.4.21-rhel-2.4-x86_64.config | 6 +- .../kernel-2.4.21-suse-2.4.21-2-x86_64.config | 4 +- .../kernel-2.6.15-2.6-fc5-i686-smp.config | 1598 +++++++ .../kernel-2.6.15-2.6-fc5-i686.config | 1591 +++++++ .../kernel_configs/kernel-2.6.15-fc5-i686.config | 1598 +++++++ .../kernel-2.6.16-2.6-patchless-i686-smp.config | 1617 +++++++ .../kernel-2.6.16-2.6-patchless-i686.config | 1613 +++++++ .../kernel-2.6.16-2.6-patchless-ia64-smp.config | 1419 ++++++ .../kernel-2.6.16-2.6-patchless-ia64.config | 1416 ++++++ .../kernel-2.6.16-2.6-patchless-x86_64-smp.config | 1460 ++++++ .../kernel-2.6.16-2.6-patchless-x86_64.config | 1459 ++++++ .../kernel-2.6.9-2.6-rhel4-i686-smp.config | 49 +- .../kernel-2.6.9-2.6-rhel4-i686.config | 48 +- .../kernel-2.6.9-2.6-rhel4-ia64-smp.config | 79 +- .../kernel-2.6.9-2.6-rhel4-ia64.config | 79 +- .../kernel-2.6.9-2.6-rhel4-x86_64-smp.config | 51 +- .../kernel-2.6.9-2.6-rhel4-x86_64.config | 106 +- .../patches/compile-fixes-2.4.21-rhel.patch | 105 +- .../patches/dev_read_only-2.6-fc5.patch | 145 + .../patches/dev_read_only-2.6.18-vanilla.patch | 145 + lustre/kernel_patches/patches/export-2.6-fc5.patch | 24 + .../patches/export-2.6.18-vanilla.patch | 24 + .../patches/export-show_task-2.6-fc5.patch | 25 + .../patches/export-show_task-2.6.18-vanilla.patch | 25 + .../patches/export-truncate-2.6.18-vanilla.patch | 39 + .../patches/export_symbol_numa-2.6-fc5.patch | 12 + .../patches/export_symbols-2.6.18-vanilla.patch | 64 + .../patches/ext3-check-jbd-errors-2.6.5.patch | 101 + .../patches/ext3-check-jbd-errors-2.6.9.patch | 101 + .../patches/ext3-ea-in-inode-2.4.20.patch | 14 +- .../patches/ext3-ea-in-inode-2.4.21-chaos.patch | 14 +- .../patches/ext3-ea-in-inode-2.4.21-suse2.patch | 14 +- .../patches/ext3-ea-in-inode-2.4.22-rh.patch | 14 +- .../patches/ext3-ea-in-inode-2.4.29.patch | 14 +- .../patches/ext3-ea-in-inode-2.6-rhel4.patch | 10 +- .../patches/ext3-ea-in-inode-2.6-suse.patch | 10 +- .../patches/ext3-extents-2.4.21-chaos.patch | 25 +- .../patches/ext3-extents-2.4.21-suse2.patch | 25 +- .../patches/ext3-extents-2.4.24.patch | 25 +- .../patches/ext3-extents-2.4.29.patch | 13 +- .../patches/ext3-extents-2.6.12.patch | 34 +- .../patches/ext3-extents-2.6.15.patch | 2933 ++++++++++++ .../patches/ext3-extents-2.6.18-vanilla.patch | 2935 ++++++++++++ .../patches/ext3-extents-2.6.5.patch | 34 +- .../patches/ext3-extents-2.6.9-rhel4.patch | 34 +- .../patches/ext3-external-journal-2.6.9.patch | 150 + .../patches/ext3-filterdata-2.6.15.patch | 25 + .../patches/ext3-hash-selection.patch | 55 +- .../patches/ext3-lookup-dotdot-2.4.20.patch | 63 + .../patches/ext3-lookup-dotdot-2.6.9.patch | 63 + .../patches/ext3-mballoc2-2.6-fc5.patch | 2779 +++++++++++ .../patches/ext3-mballoc2-2.6-suse.patch | 552 ++- .../patches/ext3-mballoc2-2.6.12.patch | 540 ++- .../patches/ext3-mballoc2-2.6.18-vanilla.patch | 2810 ++++++++++++ .../patches/ext3-mballoc2-2.6.9-rhel4.patch | 556 ++- .../ext3-multi-mount-protection-2.6-fc5.patch | 381 ++ ...xt3-multi-mount-protection-2.6.18-vanilla.patch | 381 ++ .../patches/ext3-sector_t-overflow-2.4.patch | 41 + .../patches/ext3-sector_t-overflow-2.6.12.patch | 64 + .../ext3-sector_t-overflow-2.6.5-suse.patch | 44 + .../ext3-sector_t-overflow-2.6.9-rhel4.patch | 64 + .../patches/ext3-wantedi-2.6-rhel4.patch | 7 +- .../patches/ext3-wantedi-2.6-suse.patch | 7 +- .../patches/ext3-wantedi-2.6.15.patch | 174 + .../patches/ext3-wantedi-misc-2.6.18-vanilla.patch | 16 + lustre/kernel_patches/patches/iopen-2.6-fc5.patch | 448 ++ .../patches/iopen-misc-2.6.18-vanilla.patch | 82 + .../patches/jbd-jcberr-2.6.18-vanilla.patch | 228 + .../llnl-frame-pointer-walk-2.4.21-rhel.patch | 27 +- .../llnl-frame-pointer-walk-fix-2.4.21-rhel.patch | 68 +- .../patches/nfs-cifs-intent-2.6-fc5.patch | 116 + .../patches/nfs-cifs-intent-2.6.18-vanilla.patch | 120 + .../patches/raid5-configurable-cachesize.patch | 50 + lustre/kernel_patches/patches/raid5-large-io.patch | 20 + .../kernel_patches/patches/raid5-merge-ios.patch | 129 + .../patches/raid5-optimize-memcpy.patch | 226 + .../patches/raid5-serialize-ovelapping-reqs.patch | 140 + lustre/kernel_patches/patches/raid5-stats.patch | 200 + .../patches/raid5-stripe-by-stripe-handling.patch | 104 + .../patches/small_scatterlist-2.4.21-rhel.patch | 4810 ++++++++++---------- .../patches/tcp-zero-copy-2.6-fc5.patch | 475 ++ .../patches/tcp-zero-copy-2.6-sles10.patch | 450 ++ .../patches/tcp-zero-copy-2.6.18-vanilla.patch | 450 ++ .../patches/vfs_intent-2.4.21-rhel.patch | 382 +- .../patches/vfs_intent-2.6-fc5-fix.patch | 20 + .../patches/vfs_intent-2.6-fc5.patch | 827 ++++ .../patches/vfs_intent-2.6-sles10.patch | 863 ++++ .../patches/vfs_intent-2.6.18-vanilla.patch | 824 ++++ .../patches/vfs_nointent-2.6-fc5.patch | 472 ++ .../patches/vfs_nointent-2.6-sles10.patch | 453 ++ .../patches/vfs_nointent-2.6.18-vanilla.patch | 451 ++ .../patches/vfs_races-2.6.18-vanilla.patch | 61 + lustre/kernel_patches/series/2.6-fc5.series | 20 + .../kernel_patches/series/2.6-rhel4-titech.series | 30 + lustre/kernel_patches/series/2.6-rhel4.series | 7 +- lustre/kernel_patches/series/2.6-sles10.series | 20 + lustre/kernel_patches/series/2.6.18-vanilla.series | 20 + lustre/kernel_patches/series/hp-pnnl-2.4.20 | 1 + .../kernel_patches/series/ldiskfs-2.6-fc3.series | 22 + .../kernel_patches/series/ldiskfs-2.6-fc5.series | 12 + .../kernel_patches/series/ldiskfs-2.6-rhel4.series | 3 + .../kernel_patches/series/ldiskfs-2.6-suse.series | 3 + .../series/ldiskfs-2.6.12-vanilla.series | 2 + .../series/ldiskfs-2.6.18-vanilla.series | 13 + lustre/kernel_patches/series/rhel-2.4.21 | 2 + lustre/kernel_patches/series/suse-2.4.21-cray | 1 + lustre/kernel_patches/series/suse-2.4.21-jvn | 31 - lustre/kernel_patches/series/vanilla-2.4.24 | 1 + lustre/kernel_patches/series/vanilla-2.4.29 | 1 + lustre/kernel_patches/series/vanilla-2.4.29-uml | 1 + lustre/kernel_patches/targets/2.6-fc5.target.in | 18 + .../kernel_patches/targets/2.6-patchless.target.in | 25 + lustre/kernel_patches/targets/2.6-rhel4.target.in | 2 +- lustre/kernel_patches/targets/2.6-suse.target.in | 2 +- lustre/kernel_patches/targets/rhel-2.4.target.in | 2 +- lustre/kernel_patches/which_patch | 1 - lustre/ldiskfs/lustre_quota_fmt.c | 5 +- lustre/ldlm/l_lock.c | 125 +- lustre/ldlm/ldlm_extent.c | 47 +- lustre/ldlm/ldlm_flock.c | 35 +- lustre/ldlm/ldlm_inodebits.c | 33 +- lustre/ldlm/ldlm_internal.h | 25 +- lustre/ldlm/ldlm_lib.c | 156 +- lustre/ldlm/ldlm_lock.c | 469 +- lustre/ldlm/ldlm_lockd.c | 463 +- lustre/ldlm/ldlm_plain.c | 31 +- lustre/ldlm/ldlm_request.c | 558 ++- lustre/ldlm/ldlm_resource.c | 367 +- lustre/liblustre/Makefile.am | 4 +- lustre/liblustre/dir.c | 12 +- lustre/liblustre/file.c | 65 +- lustre/liblustre/genlib.sh | 2 + lustre/liblustre/llite_lib.c | 57 +- lustre/liblustre/llite_lib.h | 25 +- lustre/liblustre/namei.c | 29 +- lustre/liblustre/rw.c | 84 +- lustre/liblustre/super.c | 283 +- lustre/liblustre/tests/Makefile.am | 2 +- lustre/liblustre/tests/echo_test.c | 1 + lustre/liblustre/tests/sanity.c | 182 +- lustre/llite/Makefile.in | 8 +- lustre/llite/autoMakefile.am | 4 +- lustre/llite/dcache.c | 280 +- lustre/llite/dir.c | 274 +- lustre/llite/file.c | 764 +++- lustre/llite/llite_internal.h | 67 +- lustre/llite/llite_lib.c | 664 ++- lustre/llite/llite_mmap.c | 6 + lustre/llite/llite_nfs.c | 6 +- lustre/llite/lproc_llite.c | 133 +- lustre/llite/namei.c | 299 +- lustre/llite/rw.c | 69 +- lustre/llite/rw24.c | 22 +- lustre/llite/rw26.c | 152 +- lustre/llite/super.c | 6 + lustre/llite/super25.c | 6 +- lustre/llite/symlink.c | 40 +- lustre/llite/xattr.c | 44 +- lustre/lmv/lmv_intent.c | 29 +- lustre/lmv/lmv_obd.c | 16 +- lustre/lov/lov_ea.c | 11 +- lustre/lov/lov_internal.h | 86 +- lustre/lov/lov_log.c | 57 +- lustre/lov/lov_merge.c | 23 +- lustre/lov/lov_obd.c | 1253 ++--- lustre/lov/lov_pack.c | 15 +- lustre/lov/lov_qos.c | 723 ++- lustre/lov/lov_request.c | 776 +++- lustre/lov/lproc_lov.c | 36 +- lustre/lvfs/fsfilt_ext3.c | 76 +- lustre/lvfs/lvfs_linux.c | 8 +- lustre/lvfs/upcall_cache.c | 2 +- lustre/mdc/lproc_mdc.c | 36 + lustre/mdc/mdc_internal.h | 28 +- lustre/mdc/mdc_lib.c | 77 +- lustre/mdc/mdc_locks.c | 214 +- lustre/mdc/mdc_reint.c | 77 +- lustre/mdc/mdc_request.c | 343 +- lustre/mdd/mdd_lov.c | 15 +- lustre/mds/handler.c | 395 +- lustre/mds/mds_fs.c | 36 +- lustre/mds/mds_internal.h | 19 +- lustre/mds/mds_join.c | 7 +- lustre/mds/mds_lib.c | 85 +- lustre/mds/mds_log.c | 4 +- lustre/mds/mds_open.c | 264 +- lustre/mds/mds_reint.c | 267 +- lustre/mds/mds_unlink_open.c | 14 +- lustre/mds/mds_xattr.c | 53 +- lustre/mdt/mdt_fs.c | 2 +- lustre/mdt/mdt_handler.c | 53 +- lustre/mdt/mdt_lib.c | 8 +- lustre/mdt/mdt_open.c | 8 +- lustre/mdt/mdt_recovery.c | 12 +- lustre/mdt/mdt_reint.c | 2 +- lustre/mgc/autoMakefile.am | 11 +- lustre/mgc/libmgc.c | 126 + lustre/mgc/mgc_request.c | 107 +- lustre/mgs/mgs_fs.c | 2 +- lustre/mgs/mgs_handler.c | 145 +- lustre/mgs/mgs_internal.h | 21 +- lustre/mgs/mgs_llog.c | 892 ++-- lustre/obdclass/class_obd.c | 62 +- lustre/obdclass/debug.c | 2 +- lustre/obdclass/genops.c | 233 +- lustre/obdclass/linux/linux-module.c | 72 +- lustre/obdclass/linux/linux-obdo.c | 12 +- lustre/obdclass/llog_ioctl.c | 9 +- lustre/obdclass/llog_lvfs.c | 6 +- lustre/obdclass/llog_obd.c | 38 +- lustre/obdclass/lprocfs_status.c | 362 +- lustre/obdclass/lu_object.c | 4 +- lustre/obdclass/obd_config.c | 341 +- lustre/obdclass/obd_mount.c | 486 +- lustre/obdecho/echo.c | 41 +- lustre/obdecho/echo_client.c | 82 +- lustre/obdfilter/filter.c | 660 ++- lustre/obdfilter/filter_internal.h | 42 +- lustre/obdfilter/filter_io.c | 55 +- lustre/obdfilter/filter_io_24.c | 62 +- lustre/obdfilter/filter_io_26.c | 165 +- lustre/obdfilter/filter_log.c | 28 +- lustre/obdfilter/lproc_obdfilter.c | 85 + lustre/osc/lproc_osc.c | 22 +- lustre/osc/osc_create.c | 21 +- lustre/osc/osc_internal.h | 21 +- lustre/osc/osc_request.c | 1243 +++-- lustre/ost/lproc_ost.c | 4 +- lustre/ost/ost_handler.c | 470 +- lustre/ptlrpc/Makefile.in | 2 +- lustre/ptlrpc/autoMakefile.am | 3 +- lustre/ptlrpc/client.c | 244 +- lustre/ptlrpc/events.c | 65 +- lustre/ptlrpc/import.c | 355 +- lustre/ptlrpc/layout.c | 52 +- lustre/ptlrpc/llog_client.c | 67 +- lustre/ptlrpc/llog_net.c | 13 +- lustre/ptlrpc/llog_server.c | 73 +- lustre/ptlrpc/lproc_ptlrpc.c | 28 +- lustre/ptlrpc/niobuf.c | 31 +- lustre/ptlrpc/pack_generic.c | 3423 ++++++-------- lustre/ptlrpc/pinger.c | 38 +- lustre/ptlrpc/ptlrpc_internal.h | 1 + lustre/ptlrpc/ptlrpc_module.c | 28 + lustre/ptlrpc/ptlrpcd.c | 9 +- lustre/ptlrpc/recov_thread.c | 15 +- lustre/ptlrpc/recover.c | 42 +- lustre/ptlrpc/service.c | 196 +- lustre/ptlrpc/wirehdr.c | 10 + lustre/ptlrpc/wiretest.c | 14 + lustre/quota/quota_check.c | 29 +- lustre/quota/quota_context.c | 33 +- lustre/quota/quota_ctl.c | 23 +- lustre/quota/quota_interface.c | 2 +- lustre/tests/.cvsignore | 2 + lustre/tests/Makefile.am | 7 +- lustre/tests/acceptance-small.sh | 64 +- lustre/tests/cfg/insanity-local.sh | 32 +- lustre/tests/cfg/local.sh | 43 +- lustre/tests/cfg/lov.sh | 69 + lustre/tests/conf-sanity.sh | 130 +- lustre/tests/createdestroy.c | 6 +- lustre/tests/insanity.sh | 58 +- lustre/tests/lfscktest.sh | 226 + lustre/tests/ll_dirstripe_verify.c | 71 +- lustre/tests/llmount.sh | 45 +- lustre/tests/llmountcleanup.sh | 58 +- lustre/tests/llog-test.sh | 1 + lustre/tests/local.sh | 86 - lustre/tests/lov.sh | 79 - lustre/tests/mmap_sanity.c | 4 +- lustre/tests/mountconf.sh | 59 - lustre/tests/qos.sh | 8 +- lustre/tests/recovery-small.sh | 302 +- lustre/tests/replay-dual.sh | 37 +- lustre/tests/replay-ost-single.sh | 57 +- lustre/tests/replay-single.sh | 38 +- lustre/tests/rundbench | 4 +- lustre/tests/runtests | 9 +- lustre/tests/runvmstat | 18 +- lustre/tests/sanity-quota.sh | 2 +- lustre/tests/sanity.sh | 406 +- lustre/tests/sanityN.sh | 45 +- lustre/tests/small_write.c | 56 +- lustre/tests/stat.c | 1 + lustre/tests/test-framework.sh | 533 ++- lustre/tests/uml.sh | 133 - lustre/tests/utime.c | 85 +- lustre/tests/writemany.c | 12 +- lustre/utils/.cvsignore | 2 + lustre/utils/Makefile.am | 46 +- lustre/utils/cluster_scripts/1uml.csv | 5 - lustre/utils/cluster_scripts/cluster_config.sh | 705 --- .../utils/cluster_scripts/gen_clumanager_config.sh | 379 -- lustre/utils/cluster_scripts/gen_hb_config.sh | 591 --- lustre/utils/cluster_scripts/module_config.sh | 61 - lustre/utils/cluster_scripts/verify_cluster_net.sh | 296 -- lustre/utils/cluster_scripts/verify_serviceIP.sh | 228 - lustre/utils/lactive | 120 - lustre/utils/lconf | 32 +- lustre/utils/lfind | 9 - lustre/utils/lfs.c | 212 +- lustre/utils/liblustreapi.c | 680 ++- lustre/utils/llmount.c | 472 -- lustre/utils/llobdstat.pl | 6 +- lustre/utils/llog_reader.c | 172 +- lustre/utils/llstat.pl | 36 +- lustre/utils/llverdev.c | 552 +++ lustre/utils/llverfs.c | 650 +++ lustre/utils/lmc | 7 +- lustre/utils/lr_reader.c | 209 + lustre/utils/lstripe | 9 - lustre/utils/lustre_cfg.c | 8 - lustre/utils/mkfs_lustre.c | 418 +- lustre/utils/module_cleanup.sh | 22 + lustre/utils/module_setup.sh | 62 +- lustre/utils/mount_lustre.c | 193 +- lustre/utils/obd.c | 22 +- lustre/utils/obdio.c | 4 +- lustre/utils/obdiolib.c | 162 +- lustre/utils/obdiolib.h | 9 +- lustre/utils/req-layout.c | 5 +- lustre/utils/wirecheck.c | 181 +- lustre/utils/wirehdr.c | 9 +- lustre/utils/wiretest.c | 1784 +------- 419 files changed, 74973 insertions(+), 19890 deletions(-) create mode 100644 ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.5.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.9.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-extents-2.6.15.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-extents-2.6.18-vanilla.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-filterdata-2.6.15.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-lookup-dotdot-2.6.9.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-fc5.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.18-vanilla.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.12.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.5-suse.patch create mode 100644 ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.9-rhel4.patch create mode 100644 ldiskfs/kernel_patches/patches/iopen-2.6-fc5.patch create mode 100644 ldiskfs/kernel_patches/series/ldiskfs-2.6-fc3.series create mode 100644 ldiskfs/kernel_patches/series/ldiskfs-2.6-fc5.series create mode 100644 ldiskfs/kernel_patches/series/ldiskfs-2.6.18-vanilla.series create mode 100644 lustre/contrib/.cvsignore create mode 100644 lustre/contrib/Makefile.am create mode 100644 lustre/contrib/README create mode 100644 lustre/contrib/mpich-1.2.6-lustre.patch create mode 100644 lustre/doc/llverdev.txt create mode 100644 lustre/doc/llverfs.txt create mode 100644 lustre/doc/lustre.7 create mode 100644 lustre/doc/mkfs.lustre.8 create mode 100644 lustre/doc/mount.lustre.8 create mode 100644 lustre/doc/tunefs.lustre.8 create mode 100644 lustre/include/linux/lustre_intent.h create mode 100644 lustre/include/linux/lustre_patchless_compat.h create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.15-2.6-fc5-i686-smp.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.15-2.6-fc5-i686.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.15-fc5-i686.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.16-2.6-patchless-i686-smp.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.16-2.6-patchless-i686.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.16-2.6-patchless-ia64-smp.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.16-2.6-patchless-ia64.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.16-2.6-patchless-x86_64-smp.config create mode 100644 lustre/kernel_patches/kernel_configs/kernel-2.6.16-2.6-patchless-x86_64.config create mode 100644 lustre/kernel_patches/patches/dev_read_only-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/dev_read_only-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/export-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/export-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/export-show_task-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/export-show_task-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/export-truncate-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/export_symbol_numa-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/export_symbols-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/ext3-check-jbd-errors-2.6.5.patch create mode 100644 lustre/kernel_patches/patches/ext3-check-jbd-errors-2.6.9.patch create mode 100644 lustre/kernel_patches/patches/ext3-extents-2.6.15.patch create mode 100644 lustre/kernel_patches/patches/ext3-extents-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/ext3-external-journal-2.6.9.patch create mode 100644 lustre/kernel_patches/patches/ext3-filterdata-2.6.15.patch create mode 100644 lustre/kernel_patches/patches/ext3-lookup-dotdot-2.4.20.patch create mode 100644 lustre/kernel_patches/patches/ext3-lookup-dotdot-2.6.9.patch create mode 100644 lustre/kernel_patches/patches/ext3-mballoc2-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/ext3-mballoc2-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/ext3-multi-mount-protection-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/ext3-multi-mount-protection-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/ext3-sector_t-overflow-2.4.patch create mode 100644 lustre/kernel_patches/patches/ext3-sector_t-overflow-2.6.12.patch create mode 100644 lustre/kernel_patches/patches/ext3-sector_t-overflow-2.6.5-suse.patch create mode 100644 lustre/kernel_patches/patches/ext3-sector_t-overflow-2.6.9-rhel4.patch create mode 100644 lustre/kernel_patches/patches/ext3-wantedi-2.6.15.patch create mode 100644 lustre/kernel_patches/patches/ext3-wantedi-misc-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/iopen-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/iopen-misc-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/jbd-jcberr-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/nfs-cifs-intent-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/nfs-cifs-intent-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/raid5-configurable-cachesize.patch create mode 100644 lustre/kernel_patches/patches/raid5-large-io.patch create mode 100644 lustre/kernel_patches/patches/raid5-merge-ios.patch create mode 100644 lustre/kernel_patches/patches/raid5-optimize-memcpy.patch create mode 100644 lustre/kernel_patches/patches/raid5-serialize-ovelapping-reqs.patch create mode 100644 lustre/kernel_patches/patches/raid5-stats.patch create mode 100644 lustre/kernel_patches/patches/raid5-stripe-by-stripe-handling.patch create mode 100644 lustre/kernel_patches/patches/tcp-zero-copy-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/tcp-zero-copy-2.6-sles10.patch create mode 100644 lustre/kernel_patches/patches/tcp-zero-copy-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/vfs_intent-2.6-fc5-fix.patch create mode 100644 lustre/kernel_patches/patches/vfs_intent-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/vfs_intent-2.6-sles10.patch create mode 100644 lustre/kernel_patches/patches/vfs_intent-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/vfs_nointent-2.6-fc5.patch create mode 100644 lustre/kernel_patches/patches/vfs_nointent-2.6-sles10.patch create mode 100644 lustre/kernel_patches/patches/vfs_nointent-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/patches/vfs_races-2.6.18-vanilla.patch create mode 100644 lustre/kernel_patches/series/2.6-fc5.series create mode 100644 lustre/kernel_patches/series/2.6-rhel4-titech.series create mode 100644 lustre/kernel_patches/series/2.6-sles10.series create mode 100644 lustre/kernel_patches/series/2.6.18-vanilla.series create mode 100644 lustre/kernel_patches/series/ldiskfs-2.6-fc3.series create mode 100644 lustre/kernel_patches/series/ldiskfs-2.6-fc5.series create mode 100644 lustre/kernel_patches/series/ldiskfs-2.6.18-vanilla.series delete mode 100644 lustre/kernel_patches/series/suse-2.4.21-jvn create mode 100644 lustre/kernel_patches/targets/2.6-fc5.target.in create mode 100644 lustre/kernel_patches/targets/2.6-patchless.target.in create mode 100644 lustre/mgc/libmgc.c create mode 100644 lustre/ptlrpc/wirehdr.c create mode 100644 lustre/ptlrpc/wiretest.c create mode 100644 lustre/tests/cfg/lov.sh create mode 100755 lustre/tests/lfscktest.sh delete mode 100755 lustre/tests/local.sh delete mode 100755 lustre/tests/lov.sh delete mode 100755 lustre/tests/mountconf.sh delete mode 100644 lustre/tests/uml.sh delete mode 100644 lustre/utils/cluster_scripts/1uml.csv delete mode 100755 lustre/utils/cluster_scripts/cluster_config.sh delete mode 100755 lustre/utils/cluster_scripts/gen_clumanager_config.sh delete mode 100755 lustre/utils/cluster_scripts/gen_hb_config.sh delete mode 100755 lustre/utils/cluster_scripts/module_config.sh delete mode 100755 lustre/utils/cluster_scripts/verify_cluster_net.sh delete mode 100755 lustre/utils/cluster_scripts/verify_serviceIP.sh delete mode 100644 lustre/utils/lactive delete mode 100755 lustre/utils/lfind delete mode 100644 lustre/utils/llmount.c create mode 100644 lustre/utils/llverdev.c create mode 100644 lustre/utils/llverfs.c create mode 100644 lustre/utils/lr_reader.c delete mode 100755 lustre/utils/lstripe create mode 100755 lustre/utils/module_cleanup.sh diff --git a/ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.5.patch b/ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.5.patch new file mode 100644 index 0000000..dca4676 --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.5.patch @@ -0,0 +1,101 @@ +Index: linux-2.6.5-7.201/fs/ext3/super.c +=================================================================== +--- linux-2.6.5-7.201.orig/fs/ext3/super.c 2006-06-20 19:40:44.000000000 +0400 ++++ linux-2.6.5-7.201/fs/ext3/super.c 2006-06-20 19:42:08.000000000 +0400 +@@ -39,7 +39,7 @@ + static int ext3_load_journal(struct super_block *, struct ext3_super_block *); + static int ext3_create_journal(struct super_block *, struct ext3_super_block *, + int); +-static void ext3_commit_super (struct super_block * sb, ++void ext3_commit_super (struct super_block * sb, + struct ext3_super_block * es, + int sync); + static void ext3_mark_recovery_complete(struct super_block * sb, +@@ -1781,7 +1781,7 @@ static int ext3_create_journal(struct su + return 0; + } + +-static void ext3_commit_super (struct super_block * sb, ++void ext3_commit_super (struct super_block * sb, + struct ext3_super_block * es, + int sync) + { +Index: linux-2.6.5-7.201/fs/ext3/namei.c +=================================================================== +--- linux-2.6.5-7.201.orig/fs/ext3/namei.c 2006-06-20 19:40:44.000000000 +0400 ++++ linux-2.6.5-7.201/fs/ext3/namei.c 2006-06-20 19:42:08.000000000 +0400 +@@ -1598,7 +1598,7 @@ static int ext3_delete_entry (handle_t * + struct buffer_head * bh) + { + struct ext3_dir_entry_2 * de, * pde; +- int i; ++ int i, err; + + i = 0; + pde = NULL; +@@ -1608,7 +1608,9 @@ static int ext3_delete_entry (handle_t * + return -EIO; + if (de == de_del) { + BUFFER_TRACE(bh, "get_write_access"); +- ext3_journal_get_write_access(handle, bh); ++ err = ext3_journal_get_write_access(handle, bh); ++ if (err) ++ return err; + if (pde) + pde->rec_len = + cpu_to_le16(le16_to_cpu(pde->rec_len) + +Index: linux-2.6.5-7.201/fs/ext3/xattr.c +=================================================================== +--- linux-2.6.5-7.201.orig/fs/ext3/xattr.c 2006-06-20 19:40:44.000000000 +0400 ++++ linux-2.6.5-7.201/fs/ext3/xattr.c 2006-06-20 19:42:30.000000000 +0400 +@@ -107,7 +107,7 @@ ext3_xattr_register(int name_index, stru + { + int error = -EINVAL; + +- if (name_index > 0 && name_index <= EXT3_XATTR_INDEX_MAX) { ++ if (name_index > 0 && name_index < EXT3_XATTR_INDEX_MAX) { + write_lock(&ext3_handler_lock); + if (!ext3_xattr_handlers[name_index-1]) { + ext3_xattr_handlers[name_index-1] = handler; +Index: linux-2.6.5-7.201/fs/ext3/inode.c +=================================================================== +--- linux-2.6.5-7.201.orig/fs/ext3/inode.c 2006-06-20 19:40:44.000000000 +0400 ++++ linux-2.6.5-7.201/fs/ext3/inode.c 2006-06-20 19:42:08.000000000 +0400 +@@ -1517,9 +1517,14 @@ out_stop: + if (end > inode->i_size) { + ei->i_disksize = end; + i_size_write(inode, end); +- err = ext3_mark_inode_dirty(handle, inode); +- if (!ret) +- ret = err; ++ /* ++ * We're going to return a positive `ret' ++ * here due to non-zero-length I/O, so there's ++ * no way of reporting error returns from ++ * ext3_mark_inode_dirty() to userspace. So ++ * ignore it. ++ */ ++ ext3_mark_inode_dirty(handle, inode); + } + } + err = ext3_journal_stop(handle); +@@ -1811,8 +1816,18 @@ ext3_clear_blocks(handle_t *handle, stru + ext3_mark_inode_dirty(handle, inode); + ext3_journal_test_restart(handle, inode); + if (bh) { ++ int err; + BUFFER_TRACE(bh, "retaking write access"); +- ext3_journal_get_write_access(handle, bh); ++ err = ext3_journal_get_write_access(handle, bh); ++ if (err) { ++ struct super_block *sb = inode->i_sb; ++ struct ext3_super_block *es = EXT3_SB(sb)->s_es; ++ printk (KERN_CRIT"EXT3-fs: can't continue truncate\n"); ++ EXT3_SB(sb)->s_mount_state |= EXT3_ERROR_FS; ++ es->s_state |= cpu_to_le16(EXT3_ERROR_FS); ++ ext3_commit_super(sb, es, 1); ++ return; ++ } + } + } + diff --git a/ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.9.patch b/ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.9.patch new file mode 100644 index 0000000..df3d2ea --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-check-jbd-errors-2.6.9.patch @@ -0,0 +1,101 @@ +Index: linux-2.6.9-full/fs/ext3/super.c +=================================================================== +--- linux-2.6.9-full.orig/fs/ext3/super.c 2006-06-02 23:37:51.000000000 +0400 ++++ linux-2.6.9-full/fs/ext3/super.c 2006-06-02 23:56:29.000000000 +0400 +@@ -43,7 +43,7 @@ static int ext3_load_journal(struct supe + unsigned long journal_devnum); + static int ext3_create_journal(struct super_block *, struct ext3_super_block *, + int); +-static void ext3_commit_super (struct super_block * sb, ++void ext3_commit_super (struct super_block * sb, + struct ext3_super_block * es, + int sync); + static void ext3_mark_recovery_complete(struct super_block * sb, +@@ -1991,7 +1991,7 @@ static int ext3_create_journal(struct su + return 0; + } + +-static void ext3_commit_super (struct super_block * sb, ++void ext3_commit_super (struct super_block * sb, + struct ext3_super_block * es, + int sync) + { +Index: linux-2.6.9-full/fs/ext3/namei.c +=================================================================== +--- linux-2.6.9-full.orig/fs/ext3/namei.c 2006-06-02 23:37:49.000000000 +0400 ++++ linux-2.6.9-full/fs/ext3/namei.c 2006-06-02 23:43:31.000000000 +0400 +@@ -1599,7 +1599,7 @@ static int ext3_delete_entry (handle_t * + struct buffer_head * bh) + { + struct ext3_dir_entry_2 * de, * pde; +- int i; ++ int i, err; + + i = 0; + pde = NULL; +@@ -1609,7 +1609,9 @@ static int ext3_delete_entry (handle_t * + return -EIO; + if (de == de_del) { + BUFFER_TRACE(bh, "get_write_access"); +- ext3_journal_get_write_access(handle, bh); ++ err = ext3_journal_get_write_access(handle, bh); ++ if (err) ++ return err; + if (pde) + pde->rec_len = + cpu_to_le16(le16_to_cpu(pde->rec_len) + +Index: linux-2.6.9-full/fs/ext3/xattr.c +=================================================================== +--- linux-2.6.9-full.orig/fs/ext3/xattr.c 2006-06-01 14:58:48.000000000 +0400 ++++ linux-2.6.9-full/fs/ext3/xattr.c 2006-06-03 00:02:00.000000000 +0400 +@@ -132,7 +132,7 @@ ext3_xattr_handler(int name_index) + { + struct xattr_handler *handler = NULL; + +- if (name_index > 0 && name_index <= EXT3_XATTR_INDEX_MAX) ++ if (name_index > 0 && name_index < EXT3_XATTR_INDEX_MAX) + handler = ext3_xattr_handler_map[name_index]; + return handler; + } +Index: linux-2.6.9-full/fs/ext3/inode.c +=================================================================== +--- linux-2.6.9-full.orig/fs/ext3/inode.c 2006-06-02 23:37:38.000000000 +0400 ++++ linux-2.6.9-full/fs/ext3/inode.c 2006-06-03 00:27:41.000000000 +0400 +@@ -1513,9 +1513,14 @@ out_stop: + if (end > inode->i_size) { + ei->i_disksize = end; + i_size_write(inode, end); +- err = ext3_mark_inode_dirty(handle, inode); +- if (!ret) +- ret = err; ++ /* ++ * We're going to return a positive `ret' ++ * here due to non-zero-length I/O, so there's ++ * no way of reporting error returns from ++ * ext3_mark_inode_dirty() to userspace. So ++ * ignore it. ++ */ ++ ext3_mark_inode_dirty(handle, inode); + } + } + err = ext3_journal_stop(handle); +@@ -1807,8 +1812,18 @@ ext3_clear_blocks(handle_t *handle, stru + ext3_mark_inode_dirty(handle, inode); + ext3_journal_test_restart(handle, inode); + if (bh) { ++ int err; + BUFFER_TRACE(bh, "retaking write access"); +- ext3_journal_get_write_access(handle, bh); ++ err = ext3_journal_get_write_access(handle, bh); ++ if (err) { ++ struct super_block *sb = inode->i_sb; ++ struct ext3_super_block *es = EXT3_SB(sb)->s_es; ++ printk (KERN_CRIT"EXT3-fs: can't continue truncate\n"); ++ EXT3_SB(sb)->s_mount_state |= EXT3_ERROR_FS; ++ es->s_state |= cpu_to_le16(EXT3_ERROR_FS); ++ ext3_commit_super(sb, es, 1); ++ return; ++ } + } + } + diff --git a/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-rhel4.patch b/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-rhel4.patch index 3f5687b..89cc1b5 100644 --- a/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-rhel4.patch +++ b/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-rhel4.patch @@ -2,15 +2,13 @@ Index: linux-stage/fs/ext3/ialloc.c =================================================================== --- linux-stage.orig/fs/ext3/ialloc.c 2005-10-04 16:53:24.000000000 -0600 +++ linux-stage/fs/ext3/ialloc.c 2005-10-04 17:07:25.000000000 -0600 -@@ -629,6 +629,11 @@ +@@ -629,6 +629,9 @@ spin_unlock(&sbi->s_next_gen_lock); ei->i_state = EXT3_STATE_NEW; -+ if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) { -+ ei->i_extra_isize = sizeof(__u16) /* i_extra_isize */ -+ + sizeof(__u16); /* i_pad1 */ -+ } else -+ ei->i_extra_isize = 0; ++ ei->i_extra_isize = ++ (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ? ++ sizeof(struct ext3_inode) - EXT3_GOOD_OLD_INODE_SIZE : 0; ret = inode; if(DQUOT_ALLOC_INODE(inode)) { diff --git a/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-suse.patch b/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-suse.patch index 19f153d..72c25a4 100644 --- a/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-suse.patch +++ b/ldiskfs/kernel_patches/patches/ext3-ea-in-inode-2.6-suse.patch @@ -3,15 +3,13 @@ Index: linux-2.6.0/fs/ext3/ialloc.c =================================================================== --- linux-2.6.0.orig/fs/ext3/ialloc.c 2004-01-14 18:54:11.000000000 +0300 +++ linux-2.6.0/fs/ext3/ialloc.c 2004-01-14 18:54:12.000000000 +0300 -@@ -627,6 +627,11 @@ +@@ -627,6 +627,9 @@ inode->i_generation = EXT3_SB(sb)->s_next_generation++; ei->i_state = EXT3_STATE_NEW; -+ if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) { -+ ei->i_extra_isize = sizeof(__u16) /* i_extra_isize */ -+ + sizeof(__u16); /* i_pad1 */ -+ } else -+ ei->i_extra_isize = 0; ++ ei->i_extra_isize = ++ (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ? ++ sizeof(struct ext3_inode) - EXT3_GOOD_OLD_INODE_SIZE : 0; ret = inode; if(DQUOT_ALLOC_INODE(inode)) { diff --git a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.12.patch b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.12.patch index b6439e6..f421f88 100644 --- a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.12.patch +++ b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.12.patch @@ -2,7 +2,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c =================================================================== --- linux-2.6.12-rc6.orig/fs/ext3/extents.c 2005-06-14 16:31:25.756503133 +0200 +++ linux-2.6.12-rc6/fs/ext3/extents.c 2005-06-14 16:31:25.836581257 +0200 -@@ -0,0 +1,2353 @@ +@@ -0,0 +1,2359 @@ +/* + * Copyright(c) 2003, 2004, 2005, Cluster File Systems, Inc, info@clusterfs.com + * Written by Alex Tomas @@ -178,7 +178,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c +{ + struct ext3_extent_header *neh = EXT_ROOT_HDR(tree); + neh->eh_generation = ((EXT_FLAGS(neh) & ~EXT_FLAGS_CLR_UNKNOWN) << 24) | -+ (EXT_GENERATION(neh) + 1); ++ (EXT_HDR_GEN(neh) + 1); +} + +static inline int ext3_ext_space_block(struct ext3_extents_tree *tree) @@ -560,6 +560,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + + ix->ei_block = logical; + ix->ei_leaf = ptr; ++ ix->ei_leaf_hi = ix->ei_unused = 0; + curp->p_hdr->eh_entries++; + + EXT_ASSERT(curp->p_hdr->eh_entries <= curp->p_hdr->eh_max); @@ -722,6 +723,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + fidx = EXT_FIRST_INDEX(neh); + fidx->ei_block = border; + fidx->ei_leaf = oldblock; ++ fidx->ei_leaf_hi = fidx->ei_unused = 0; + + ext_debug(tree, "int.index at %d (block %lu): %lu -> %lu\n", + i, newblock, border, oldblock); @@ -855,6 +857,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + /* FIXME: it works, but actually path[0] can be index */ + curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block; + curp->p_idx->ei_leaf = newblock; ++ curp->p_idx->ei_leaf_hi = curp->p_idx->ei_unused = 0; + + neh = EXT_ROOT_HDR(tree); + fidx = EXT_FIRST_INDEX(neh); @@ -1403,6 +1406,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) { + ex->ee_block = cex->ec_block; + ex->ee_start = cex->ec_start; ++ ex->ee_start_hi = 0; + ex->ee_len = cex->ec_len; + ext_debug(tree, "%lu cached by %lu:%lu:%lu\n", + (unsigned long) block, @@ -1624,7 +1628,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + + if (num == 0) { + /* this extent is removed entirely mark slot unused */ -+ ex->ee_start = 0; ++ ex->ee_start = ex->ee_start_hi = 0; + eh->eh_entries--; + fu = ex; + } @@ -1646,7 +1650,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + while (lu < le) { + if (lu->ee_start) { + *fu = *lu; -+ lu->ee_start = 0; ++ lu->ee_start = lu->ee_start_hi = 0; + fu++; + } + lu++; @@ -2001,6 +2005,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + /* allocate new block for the extent */ + goal = ext3_ext_find_goal(inode, path, ex->ee_block); + ex->ee_start = ext3_new_block(handle, inode, goal, err); ++ ex->ee_start_hi = 0; + if (ex->ee_start == 0) { + /* error occured: restore old extent */ + ex->ee_start = newblock; @@ -2116,6 +2121,7 @@ Index: linux-2.6.12-rc6/fs/ext3/extents.c + /* try to insert new extent into found leaf and return */ + newex.ee_block = iblock; + newex.ee_start = newblock; ++ newex.ee_start_hi = 0; + newex.ee_len = 1; + err = ext3_ext_insert_extent(handle, &tree, path, &newex); + if (err) @@ -2523,26 +2529,30 @@ Index: linux-2.6.12-rc6/fs/ext3/super.c Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, -+ Opt_extents, Opt_extdebug, ++ Opt_extents, Opt_noextents, Opt_extdebug, }; static match_table_t tokens = { -@@ -644,6 +647,8 @@ +@@ -644,6 +647,9 @@ {Opt_iopen, "iopen"}, {Opt_noiopen, "noiopen"}, {Opt_iopen_nopriv, "iopen_nopriv"}, + {Opt_extents, "extents"}, ++ {Opt_noextents, "noextents"}, + {Opt_extdebug, "extdebug"}, {Opt_barrier, "barrier=%u"}, {Opt_err, NULL}, {Opt_resize, "resize"}, -@@ -953,6 +958,12 @@ +@@ -953,6 +958,15 @@ case Opt_nobh: set_opt(sbi->s_mount_opt, NOBH); break; + case Opt_extents: + set_opt (sbi->s_mount_opt, EXTENTS); + break; ++ case Opt_noextents: ++ clear_opt (sbi->s_mount_opt, EXTENTS); ++ break; + case Opt_extdebug: + set_opt (sbi->s_mount_opt, EXTDEBUG); + break; @@ -2621,11 +2631,13 @@ Index: linux-2.6.12-rc6/include/linux/ext3_fs.h #define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \ EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \ EXT3_FEATURE_RO_COMPAT_BTREE_DIR) -@@ -759,6 +767,7 @@ +@@ -759,6 +767,9 @@ /* inode.c */ -+extern int ext3_block_truncate_page(handle_t *, struct page *, struct address_space *, loff_t); ++extern int ext3_block_truncate_page(handle_t *, struct page *, ++ struct address_space *, loff_t); ++extern int ext3_writepage_trans_blocks(struct inode *inode); extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, int); extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *); extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *); @@ -2849,14 +2861,14 @@ Index: linux-2.6.12-rc6/include/linux/ext3_extents.h + (EXT_FIRST_EXTENT((__hdr__)) + (__hdr__)->eh_max - 1) +#define EXT_MAX_INDEX(__hdr__) \ + (EXT_FIRST_INDEX((__hdr__)) + (__hdr__)->eh_max - 1) -+#define EXT_GENERATION(__hdr__) ((__hdr__)->eh_generation & 0x00ffffff) ++#define EXT_HDR_GEN(__hdr__) ((__hdr__)->eh_generation & 0x00ffffff) +#define EXT_FLAGS(__hdr__) ((__hdr__)->eh_generation >> 24) +#define EXT_FLAGS_CLR_UNKNOWN 0x7 /* Flags cleared on modification */ + +#define EXT_BLOCK_HDR(__bh__) ((struct ext3_extent_header *)(__bh__)->b_data) +#define EXT_ROOT_HDR(__tree__) ((struct ext3_extent_header *)(__tree__)->root) +#define EXT_DEPTH(__tree__) (EXT_ROOT_HDR(__tree__)->eh_depth) -+ ++#define EXT_GENERATION(__tree__) EXT_HDR_GEN(EXT_ROOT_HDR(__tree__)) + +#define EXT_ASSERT(__x__) if (!(__x__)) BUG(); + diff --git a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.15.patch b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.15.patch new file mode 100644 index 0000000..3e18d55 --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.15.patch @@ -0,0 +1,2933 @@ +Index: linux-2.6.16.21-0.8/fs/ext3/extents.c +=================================================================== +--- /dev/null ++++ linux-2.6.16.21-0.8/fs/ext3/extents.c +@@ -0,0 +1,2347 @@ ++/* ++ * Copyright(c) 2003, 2004, 2005, Cluster File Systems, Inc, info@clusterfs.com ++ * Written by Alex Tomas ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License version 2 as ++ * published by the Free Software Foundation. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public Licens ++ * along with this program; if not, write to the Free Software ++ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- ++ */ ++ ++/* ++ * Extents support for EXT3 ++ * ++ * TODO: ++ * - ext3_ext_walk_space() sould not use ext3_ext_find_extent() ++ * - ext3_ext_calc_credits() could take 'mergable' into account ++ * - ext3*_error() should be used in some situations ++ * - find_goal() [to be tested and improved] ++ * - smart tree reduction ++ * - arch-independence ++ * common on-disk format for big/little-endian arch ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++ ++static inline int ext3_ext_check_header(struct ext3_extent_header *eh) ++{ ++ if (eh->eh_magic != EXT3_EXT_MAGIC) { ++ printk(KERN_ERR "EXT3-fs: invalid magic = 0x%x\n", ++ (unsigned)eh->eh_magic); ++ return -EIO; ++ } ++ if (eh->eh_max == 0) { ++ printk(KERN_ERR "EXT3-fs: invalid eh_max = %u\n", ++ (unsigned)eh->eh_max); ++ return -EIO; ++ } ++ if (eh->eh_entries > eh->eh_max) { ++ printk(KERN_ERR "EXT3-fs: invalid eh_entries = %u\n", ++ (unsigned)eh->eh_entries); ++ return -EIO; ++ } ++ return 0; ++} ++ ++static handle_t *ext3_ext_journal_restart(handle_t *handle, int needed) ++{ ++ int err; ++ ++ if (handle->h_buffer_credits > needed) ++ return handle; ++ if (!ext3_journal_extend(handle, needed)) ++ return handle; ++ err = ext3_journal_restart(handle, needed); ++ ++ return handle; ++} ++ ++static int inline ++ext3_ext_get_access_for_root(handle_t *h, struct ext3_extents_tree *tree) ++{ ++ if (tree->ops->get_write_access) ++ return tree->ops->get_write_access(h,tree->buffer); ++ else ++ return 0; ++} ++ ++static int inline ++ext3_ext_mark_root_dirty(handle_t *h, struct ext3_extents_tree *tree) ++{ ++ if (tree->ops->mark_buffer_dirty) ++ return tree->ops->mark_buffer_dirty(h,tree->buffer); ++ else ++ return 0; ++} ++ ++/* ++ * could return: ++ * - EROFS ++ * - ENOMEM ++ */ ++static int ext3_ext_get_access(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int err; ++ ++ if (path->p_bh) { ++ /* path points to block */ ++ err = ext3_journal_get_write_access(handle, path->p_bh); ++ } else { ++ /* path points to leaf/index in inode body */ ++ err = ext3_ext_get_access_for_root(handle, tree); ++ } ++ return err; ++} ++ ++/* ++ * could return: ++ * - EROFS ++ * - ENOMEM ++ * - EIO ++ */ ++static int ext3_ext_dirty(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int err; ++ if (path->p_bh) { ++ /* path points to block */ ++ err =ext3_journal_dirty_metadata(handle, path->p_bh); ++ } else { ++ /* path points to leaf/index in inode body */ ++ err = ext3_ext_mark_root_dirty(handle, tree); ++ } ++ return err; ++} ++ ++static int inline ++ext3_ext_new_block(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, struct ext3_extent *ex, ++ int *err) ++{ ++ int goal, depth, newblock; ++ struct inode *inode; ++ ++ EXT_ASSERT(tree); ++ if (tree->ops->new_block) ++ return tree->ops->new_block(handle, tree, path, ex, err); ++ ++ inode = tree->inode; ++ depth = EXT_DEPTH(tree); ++ if (path && depth > 0) { ++ goal = path[depth-1].p_block; ++ } else { ++ struct ext3_inode_info *ei = EXT3_I(inode); ++ unsigned long bg_start; ++ unsigned long colour; ++ ++ bg_start = (ei->i_block_group * ++ EXT3_BLOCKS_PER_GROUP(inode->i_sb)) + ++ le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block); ++ colour = (current->pid % 16) * ++ (EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16); ++ goal = bg_start + colour; ++ } ++ ++ newblock = ext3_new_block(handle, inode, goal, err); ++ return newblock; ++} ++ ++static inline void ext3_ext_tree_changed(struct ext3_extents_tree *tree) ++{ ++ struct ext3_extent_header *neh; ++ neh = EXT_ROOT_HDR(tree); ++ neh->eh_generation++; ++} ++ ++static inline int ext3_ext_space_block(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->inode->i_sb->s_blocksize - ++ sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent); ++#ifdef AGRESSIVE_TEST ++ size = 6; ++#endif ++ return size; ++} ++ ++static inline int ext3_ext_space_block_idx(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->inode->i_sb->s_blocksize - ++ sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent_idx); ++#ifdef AGRESSIVE_TEST ++ size = 5; ++#endif ++ return size; ++} ++ ++static inline int ext3_ext_space_root(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->buffer_len - sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent); ++#ifdef AGRESSIVE_TEST ++ size = 3; ++#endif ++ return size; ++} ++ ++static inline int ext3_ext_space_root_idx(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->buffer_len - sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent_idx); ++#ifdef AGRESSIVE_TEST ++ size = 4; ++#endif ++ return size; ++} ++ ++static void ext3_ext_show_path(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++#ifdef EXT_DEBUG ++ int k, l = path->p_depth; ++ ++ ext_debug(tree, "path:"); ++ for (k = 0; k <= l; k++, path++) { ++ if (path->p_idx) { ++ ext_debug(tree, " %d->%d", path->p_idx->ei_block, ++ path->p_idx->ei_leaf); ++ } else if (path->p_ext) { ++ ext_debug(tree, " %d:%d:%d", ++ path->p_ext->ee_block, ++ path->p_ext->ee_len, ++ path->p_ext->ee_start); ++ } else ++ ext_debug(tree, " []"); ++ } ++ ext_debug(tree, "\n"); ++#endif ++} ++ ++static void ext3_ext_show_leaf(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++#ifdef EXT_DEBUG ++ int depth = EXT_DEPTH(tree); ++ struct ext3_extent_header *eh; ++ struct ext3_extent *ex; ++ int i; ++ ++ if (!path) ++ return; ++ ++ eh = path[depth].p_hdr; ++ ex = EXT_FIRST_EXTENT(eh); ++ ++ for (i = 0; i < eh->eh_entries; i++, ex++) { ++ ext_debug(tree, "%d:%d:%d ", ++ ex->ee_block, ex->ee_len, ex->ee_start); ++ } ++ ext_debug(tree, "\n"); ++#endif ++} ++ ++static void ext3_ext_drop_refs(struct ext3_ext_path *path) ++{ ++ int depth = path->p_depth; ++ int i; ++ ++ for (i = 0; i <= depth; i++, path++) { ++ if (path->p_bh) { ++ brelse(path->p_bh); ++ path->p_bh = NULL; ++ } ++ } ++} ++ ++/* ++ * binary search for closest index by given block ++ */ ++static inline void ++ext3_ext_binsearch_idx(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, int block) ++{ ++ struct ext3_extent_header *eh = path->p_hdr; ++ struct ext3_extent_idx *ix; ++ int l = 0, k, r; ++ ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ EXT_ASSERT(eh->eh_entries <= eh->eh_max); ++ EXT_ASSERT(eh->eh_entries > 0); ++ ++ ext_debug(tree, "binsearch for %d(idx): ", block); ++ ++ path->p_idx = ix = EXT_FIRST_INDEX(eh); ++ ++ r = k = eh->eh_entries; ++ while (k > 1) { ++ k = (r - l) / 2; ++ if (block < ix[l + k].ei_block) ++ r -= k; ++ else ++ l += k; ++ ext_debug(tree, "%d:%d:%d ", k, l, r); ++ } ++ ++ ix += l; ++ path->p_idx = ix; ++ ext_debug(tree," -> %d->%d ",path->p_idx->ei_block,path->p_idx->ei_leaf); ++ ++ while (l++ < r) { ++ if (block < ix->ei_block) ++ break; ++ path->p_idx = ix++; ++ } ++ ext_debug(tree, " -> %d->%d\n", path->p_idx->ei_block, ++ path->p_idx->ei_leaf); ++ ++#ifdef CHECK_BINSEARCH ++ { ++ struct ext3_extent_idx *chix; ++ ++ chix = ix = EXT_FIRST_INDEX(eh); ++ for (k = 0; k < eh->eh_entries; k++, ix++) { ++ if (k != 0 && ix->ei_block <= ix[-1].ei_block) { ++ printk("k=%d, ix=0x%p, first=0x%p\n", k, ++ ix, EXT_FIRST_INDEX(eh)); ++ printk("%u <= %u\n", ++ ix->ei_block,ix[-1].ei_block); ++ } ++ EXT_ASSERT(k == 0 || ix->ei_block > ix[-1].ei_block); ++ if (block < ix->ei_block) ++ break; ++ chix = ix; ++ } ++ EXT_ASSERT(chix == path->p_idx); ++ } ++#endif ++} ++ ++/* ++ * binary search for closest extent by given block ++ */ ++static inline void ++ext3_ext_binsearch(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, int block) ++{ ++ struct ext3_extent_header *eh = path->p_hdr; ++ struct ext3_extent *ex; ++ int l = 0, k, r; ++ ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ EXT_ASSERT(eh->eh_entries <= eh->eh_max); ++ ++ if (eh->eh_entries == 0) { ++ /* ++ * this leaf is empty yet: ++ * we get such a leaf in split/add case ++ */ ++ return; ++ } ++ ++ ext_debug(tree, "binsearch for %d: ", block); ++ ++ path->p_ext = ex = EXT_FIRST_EXTENT(eh); ++ ++ r = k = eh->eh_entries; ++ while (k > 1) { ++ k = (r - l) / 2; ++ if (block < ex[l + k].ee_block) ++ r -= k; ++ else ++ l += k; ++ ext_debug(tree, "%d:%d:%d ", k, l, r); ++ } ++ ++ ex += l; ++ path->p_ext = ex; ++ ext_debug(tree, " -> %d:%d:%d ", path->p_ext->ee_block, ++ path->p_ext->ee_start, path->p_ext->ee_len); ++ ++ while (l++ < r) { ++ if (block < ex->ee_block) ++ break; ++ path->p_ext = ex++; ++ } ++ ext_debug(tree, " -> %d:%d:%d\n", path->p_ext->ee_block, ++ path->p_ext->ee_start, path->p_ext->ee_len); ++ ++#ifdef CHECK_BINSEARCH ++ { ++ struct ext3_extent *chex; ++ ++ chex = ex = EXT_FIRST_EXTENT(eh); ++ for (k = 0; k < eh->eh_entries; k++, ex++) { ++ EXT_ASSERT(k == 0 || ex->ee_block > ex[-1].ee_block); ++ if (block < ex->ee_block) ++ break; ++ chex = ex; ++ } ++ EXT_ASSERT(chex == path->p_ext); ++ } ++#endif ++} ++ ++int ext3_extent_tree_init(handle_t *handle, struct ext3_extents_tree *tree) ++{ ++ struct ext3_extent_header *eh; ++ ++ BUG_ON(tree->buffer_len == 0); ++ ext3_ext_get_access_for_root(handle, tree); ++ eh = EXT_ROOT_HDR(tree); ++ eh->eh_depth = 0; ++ eh->eh_entries = 0; ++ eh->eh_magic = EXT3_EXT_MAGIC; ++ eh->eh_max = ext3_ext_space_root(tree); ++ ext3_ext_mark_root_dirty(handle, tree); ++ ext3_ext_invalidate_cache(tree); ++ return 0; ++} ++ ++struct ext3_ext_path * ++ext3_ext_find_extent(struct ext3_extents_tree *tree, int block, ++ struct ext3_ext_path *path) ++{ ++ struct ext3_extent_header *eh; ++ struct buffer_head *bh; ++ int depth, i, ppos = 0; ++ ++ EXT_ASSERT(tree); ++ EXT_ASSERT(tree->inode); ++ EXT_ASSERT(tree->root); ++ ++ eh = EXT_ROOT_HDR(tree); ++ EXT_ASSERT(eh); ++ if (ext3_ext_check_header(eh)) ++ goto err; ++ ++ i = depth = EXT_DEPTH(tree); ++ EXT_ASSERT(eh->eh_max); ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ ++ /* account possible depth increase */ ++ if (!path) { ++ path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 2), ++ GFP_NOFS); ++ if (!path) ++ return ERR_PTR(-ENOMEM); ++ } ++ memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1)); ++ path[0].p_hdr = eh; ++ ++ /* walk through the tree */ ++ while (i) { ++ ext_debug(tree, "depth %d: num %d, max %d\n", ++ ppos, eh->eh_entries, eh->eh_max); ++ ext3_ext_binsearch_idx(tree, path + ppos, block); ++ path[ppos].p_block = path[ppos].p_idx->ei_leaf; ++ path[ppos].p_depth = i; ++ path[ppos].p_ext = NULL; ++ ++ bh = sb_bread(tree->inode->i_sb, path[ppos].p_block); ++ if (!bh) ++ goto err; ++ ++ eh = EXT_BLOCK_HDR(bh); ++ ppos++; ++ EXT_ASSERT(ppos <= depth); ++ path[ppos].p_bh = bh; ++ path[ppos].p_hdr = eh; ++ i--; ++ ++ if (ext3_ext_check_header(eh)) ++ goto err; ++ } ++ ++ path[ppos].p_depth = i; ++ path[ppos].p_hdr = eh; ++ path[ppos].p_ext = NULL; ++ path[ppos].p_idx = NULL; ++ ++ if (ext3_ext_check_header(eh)) ++ goto err; ++ ++ /* find extent */ ++ ext3_ext_binsearch(tree, path + ppos, block); ++ ++ ext3_ext_show_path(tree, path); ++ ++ return path; ++ ++err: ++ printk(KERN_ERR "EXT3-fs: header is corrupted!\n"); ++ ext3_ext_drop_refs(path); ++ kfree(path); ++ return ERR_PTR(-EIO); ++} ++ ++/* ++ * insert new index [logical;ptr] into the block at cupr ++ * it check where to insert: before curp or after curp ++ */ ++static int ext3_ext_insert_index(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *curp, ++ int logical, int ptr) ++{ ++ struct ext3_extent_idx *ix; ++ int len, err; ++ ++ if ((err = ext3_ext_get_access(handle, tree, curp))) ++ return err; ++ ++ EXT_ASSERT(logical != curp->p_idx->ei_block); ++ len = EXT_MAX_INDEX(curp->p_hdr) - curp->p_idx; ++ if (logical > curp->p_idx->ei_block) { ++ /* insert after */ ++ if (curp->p_idx != EXT_LAST_INDEX(curp->p_hdr)) { ++ len = (len - 1) * sizeof(struct ext3_extent_idx); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert new index %d after: %d. " ++ "move %d from 0x%p to 0x%p\n", ++ logical, ptr, len, ++ (curp->p_idx + 1), (curp->p_idx + 2)); ++ memmove(curp->p_idx + 2, curp->p_idx + 1, len); ++ } ++ ix = curp->p_idx + 1; ++ } else { ++ /* insert before */ ++ len = len * sizeof(struct ext3_extent_idx); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert new index %d before: %d. " ++ "move %d from 0x%p to 0x%p\n", ++ logical, ptr, len, ++ curp->p_idx, (curp->p_idx + 1)); ++ memmove(curp->p_idx + 1, curp->p_idx, len); ++ ix = curp->p_idx; ++ } ++ ++ ix->ei_block = logical; ++ ix->ei_leaf = ptr; ++ curp->p_hdr->eh_entries++; ++ ++ EXT_ASSERT(curp->p_hdr->eh_entries <= curp->p_hdr->eh_max); ++ EXT_ASSERT(ix <= EXT_LAST_INDEX(curp->p_hdr)); ++ ++ err = ext3_ext_dirty(handle, tree, curp); ++ ext3_std_error(tree->inode->i_sb, err); ++ ++ return err; ++} ++ ++/* ++ * routine inserts new subtree into the path, using free index entry ++ * at depth 'at: ++ * - allocates all needed blocks (new leaf and all intermediate index blocks) ++ * - makes decision where to split ++ * - moves remaining extens and index entries (right to the split point) ++ * into the newly allocated blocks ++ * - initialize subtree ++ */ ++static int ext3_ext_split(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext, int at) ++{ ++ struct buffer_head *bh = NULL; ++ int depth = EXT_DEPTH(tree); ++ struct ext3_extent_header *neh; ++ struct ext3_extent_idx *fidx; ++ struct ext3_extent *ex; ++ int i = at, k, m, a; ++ unsigned long newblock, oldblock, border; ++ int *ablocks = NULL; /* array of allocated blocks */ ++ int err = 0; ++ ++ /* make decision: where to split? */ ++ /* FIXME: now desicion is simplest: at current extent */ ++ ++ /* if current leaf will be splitted, then we should use ++ * border from split point */ ++ EXT_ASSERT(path[depth].p_ext <= EXT_MAX_EXTENT(path[depth].p_hdr)); ++ if (path[depth].p_ext != EXT_MAX_EXTENT(path[depth].p_hdr)) { ++ border = path[depth].p_ext[1].ee_block; ++ ext_debug(tree, "leaf will be splitted." ++ " next leaf starts at %d\n", ++ (int)border); ++ } else { ++ border = newext->ee_block; ++ ext_debug(tree, "leaf will be added." ++ " next leaf starts at %d\n", ++ (int)border); ++ } ++ ++ /* ++ * if error occurs, then we break processing ++ * and turn filesystem read-only. so, index won't ++ * be inserted and tree will be in consistent ++ * state. next mount will repair buffers too ++ */ ++ ++ /* ++ * get array to track all allocated blocks ++ * we need this to handle errors and free blocks ++ * upon them ++ */ ++ ablocks = kmalloc(sizeof(unsigned long) * depth, GFP_NOFS); ++ if (!ablocks) ++ return -ENOMEM; ++ memset(ablocks, 0, sizeof(unsigned long) * depth); ++ ++ /* allocate all needed blocks */ ++ ext_debug(tree, "allocate %d blocks for indexes/leaf\n", depth - at); ++ for (a = 0; a < depth - at; a++) { ++ newblock = ext3_ext_new_block(handle, tree, path, newext, &err); ++ if (newblock == 0) ++ goto cleanup; ++ ablocks[a] = newblock; ++ } ++ ++ /* initialize new leaf */ ++ newblock = ablocks[--a]; ++ EXT_ASSERT(newblock); ++ bh = sb_getblk(tree->inode->i_sb, newblock); ++ if (!bh) { ++ err = -EIO; ++ goto cleanup; ++ } ++ lock_buffer(bh); ++ ++ if ((err = ext3_journal_get_create_access(handle, bh))) ++ goto cleanup; ++ ++ neh = EXT_BLOCK_HDR(bh); ++ neh->eh_entries = 0; ++ neh->eh_max = ext3_ext_space_block(tree); ++ neh->eh_magic = EXT3_EXT_MAGIC; ++ neh->eh_depth = 0; ++ ex = EXT_FIRST_EXTENT(neh); ++ ++ /* move remain of path[depth] to the new leaf */ ++ EXT_ASSERT(path[depth].p_hdr->eh_entries == path[depth].p_hdr->eh_max); ++ /* start copy from next extent */ ++ /* TODO: we could do it by single memmove */ ++ m = 0; ++ path[depth].p_ext++; ++ while (path[depth].p_ext <= ++ EXT_MAX_EXTENT(path[depth].p_hdr)) { ++ ext_debug(tree, "move %d:%d:%d in new leaf %lu\n", ++ path[depth].p_ext->ee_block, ++ path[depth].p_ext->ee_start, ++ path[depth].p_ext->ee_len, ++ newblock); ++ memmove(ex++, path[depth].p_ext++, sizeof(struct ext3_extent)); ++ neh->eh_entries++; ++ m++; ++ } ++ set_buffer_uptodate(bh); ++ unlock_buffer(bh); ++ ++ if ((err = ext3_journal_dirty_metadata(handle, bh))) ++ goto cleanup; ++ brelse(bh); ++ bh = NULL; ++ ++ /* correct old leaf */ ++ if (m) { ++ if ((err = ext3_ext_get_access(handle, tree, path + depth))) ++ goto cleanup; ++ path[depth].p_hdr->eh_entries -= m; ++ if ((err = ext3_ext_dirty(handle, tree, path + depth))) ++ goto cleanup; ++ ++ } ++ ++ /* create intermediate indexes */ ++ k = depth - at - 1; ++ EXT_ASSERT(k >= 0); ++ if (k) ++ ext_debug(tree, "create %d intermediate indices\n", k); ++ /* insert new index into current index block */ ++ /* current depth stored in i var */ ++ i = depth - 1; ++ while (k--) { ++ oldblock = newblock; ++ newblock = ablocks[--a]; ++ bh = sb_getblk(tree->inode->i_sb, newblock); ++ if (!bh) { ++ err = -EIO; ++ goto cleanup; ++ } ++ lock_buffer(bh); ++ ++ if ((err = ext3_journal_get_create_access(handle, bh))) ++ goto cleanup; ++ ++ neh = EXT_BLOCK_HDR(bh); ++ neh->eh_entries = 1; ++ neh->eh_magic = EXT3_EXT_MAGIC; ++ neh->eh_max = ext3_ext_space_block_idx(tree); ++ neh->eh_depth = depth - i; ++ fidx = EXT_FIRST_INDEX(neh); ++ fidx->ei_block = border; ++ fidx->ei_leaf = oldblock; ++ ++ ext_debug(tree, "int.index at %d (block %lu): %lu -> %lu\n", ++ i, newblock, border, oldblock); ++ /* copy indexes */ ++ m = 0; ++ path[i].p_idx++; ++ ++ ext_debug(tree, "cur 0x%p, last 0x%p\n", path[i].p_idx, ++ EXT_MAX_INDEX(path[i].p_hdr)); ++ EXT_ASSERT(EXT_MAX_INDEX(path[i].p_hdr) == ++ EXT_LAST_INDEX(path[i].p_hdr)); ++ while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) { ++ ext_debug(tree, "%d: move %d:%d in new index %lu\n", ++ i, path[i].p_idx->ei_block, ++ path[i].p_idx->ei_leaf, newblock); ++ memmove(++fidx, path[i].p_idx++, ++ sizeof(struct ext3_extent_idx)); ++ neh->eh_entries++; ++ EXT_ASSERT(neh->eh_entries <= neh->eh_max); ++ m++; ++ } ++ set_buffer_uptodate(bh); ++ unlock_buffer(bh); ++ ++ if ((err = ext3_journal_dirty_metadata(handle, bh))) ++ goto cleanup; ++ brelse(bh); ++ bh = NULL; ++ ++ /* correct old index */ ++ if (m) { ++ err = ext3_ext_get_access(handle, tree, path + i); ++ if (err) ++ goto cleanup; ++ path[i].p_hdr->eh_entries -= m; ++ err = ext3_ext_dirty(handle, tree, path + i); ++ if (err) ++ goto cleanup; ++ } ++ ++ i--; ++ } ++ ++ /* insert new index */ ++ if (!err) ++ err = ext3_ext_insert_index(handle, tree, path + at, ++ border, newblock); ++ ++cleanup: ++ if (bh) { ++ if (buffer_locked(bh)) ++ unlock_buffer(bh); ++ brelse(bh); ++ } ++ ++ if (err) { ++ /* free all allocated blocks in error case */ ++ for (i = 0; i < depth; i++) { ++ if (!ablocks[i]) ++ continue; ++ ext3_free_blocks(handle, tree->inode, ablocks[i], 1); ++ } ++ } ++ kfree(ablocks); ++ ++ return err; ++} ++ ++/* ++ * routine implements tree growing procedure: ++ * - allocates new block ++ * - moves top-level data (index block or leaf) into the new block ++ * - initialize new top-level, creating index that points to the ++ * just created block ++ */ ++static int ext3_ext_grow_indepth(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext) ++{ ++ struct ext3_ext_path *curp = path; ++ struct ext3_extent_header *neh; ++ struct ext3_extent_idx *fidx; ++ struct buffer_head *bh; ++ unsigned long newblock; ++ int err = 0; ++ ++ newblock = ext3_ext_new_block(handle, tree, path, newext, &err); ++ if (newblock == 0) ++ return err; ++ ++ bh = sb_getblk(tree->inode->i_sb, newblock); ++ if (!bh) { ++ err = -EIO; ++ ext3_std_error(tree->inode->i_sb, err); ++ return err; ++ } ++ lock_buffer(bh); ++ ++ if ((err = ext3_journal_get_create_access(handle, bh))) { ++ unlock_buffer(bh); ++ goto out; ++ } ++ ++ /* move top-level index/leaf into new block */ ++ memmove(bh->b_data, curp->p_hdr, tree->buffer_len); ++ ++ /* set size of new block */ ++ neh = EXT_BLOCK_HDR(bh); ++ /* old root could have indexes or leaves ++ * so calculate eh_max right way */ ++ if (EXT_DEPTH(tree)) ++ neh->eh_max = ext3_ext_space_block_idx(tree); ++ else ++ neh->eh_max = ext3_ext_space_block(tree); ++ neh->eh_magic = EXT3_EXT_MAGIC; ++ set_buffer_uptodate(bh); ++ unlock_buffer(bh); ++ ++ if ((err = ext3_journal_dirty_metadata(handle, bh))) ++ goto out; ++ ++ /* create index in new top-level index: num,max,pointer */ ++ if ((err = ext3_ext_get_access(handle, tree, curp))) ++ goto out; ++ ++ curp->p_hdr->eh_magic = EXT3_EXT_MAGIC; ++ curp->p_hdr->eh_max = ext3_ext_space_root_idx(tree); ++ curp->p_hdr->eh_entries = 1; ++ curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr); ++ /* FIXME: it works, but actually path[0] can be index */ ++ curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block; ++ curp->p_idx->ei_leaf = newblock; ++ ++ neh = EXT_ROOT_HDR(tree); ++ fidx = EXT_FIRST_INDEX(neh); ++ ext_debug(tree, "new root: num %d(%d), lblock %d, ptr %d\n", ++ neh->eh_entries, neh->eh_max, fidx->ei_block, fidx->ei_leaf); ++ ++ neh->eh_depth = path->p_depth + 1; ++ err = ext3_ext_dirty(handle, tree, curp); ++out: ++ brelse(bh); ++ ++ return err; ++} ++ ++/* ++ * routine finds empty index and adds new leaf. if no free index found ++ * then it requests in-depth growing ++ */ ++static int ext3_ext_create_new_leaf(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext) ++{ ++ struct ext3_ext_path *curp; ++ int depth, i, err = 0; ++ ++repeat: ++ i = depth = EXT_DEPTH(tree); ++ ++ /* walk up to the tree and look for free index entry */ ++ curp = path + depth; ++ while (i > 0 && !EXT_HAS_FREE_INDEX(curp)) { ++ i--; ++ curp--; ++ } ++ ++ /* we use already allocated block for index block ++ * so, subsequent data blocks should be contigoues */ ++ if (EXT_HAS_FREE_INDEX(curp)) { ++ /* if we found index with free entry, then use that ++ * entry: create all needed subtree and add new leaf */ ++ err = ext3_ext_split(handle, tree, path, newext, i); ++ ++ /* refill path */ ++ ext3_ext_drop_refs(path); ++ path = ext3_ext_find_extent(tree, newext->ee_block, path); ++ if (IS_ERR(path)) ++ err = PTR_ERR(path); ++ } else { ++ /* tree is full, time to grow in depth */ ++ err = ext3_ext_grow_indepth(handle, tree, path, newext); ++ ++ /* refill path */ ++ ext3_ext_drop_refs(path); ++ path = ext3_ext_find_extent(tree, newext->ee_block, path); ++ if (IS_ERR(path)) ++ err = PTR_ERR(path); ++ ++ /* ++ * only first (depth 0 -> 1) produces free space ++ * in all other cases we have to split growed tree ++ */ ++ depth = EXT_DEPTH(tree); ++ if (path[depth].p_hdr->eh_entries == path[depth].p_hdr->eh_max) { ++ /* now we need split */ ++ goto repeat; ++ } ++ } ++ ++ if (err) ++ return err; ++ ++ return 0; ++} ++ ++/* ++ * returns allocated block in subsequent extent or EXT_MAX_BLOCK ++ * NOTE: it consider block number from index entry as ++ * allocated block. thus, index entries have to be consistent ++ * with leafs ++ */ ++static unsigned long ++ext3_ext_next_allocated_block(struct ext3_ext_path *path) ++{ ++ int depth; ++ ++ EXT_ASSERT(path != NULL); ++ depth = path->p_depth; ++ ++ if (depth == 0 && path->p_ext == NULL) ++ return EXT_MAX_BLOCK; ++ ++ /* FIXME: what if index isn't full ?! */ ++ while (depth >= 0) { ++ if (depth == path->p_depth) { ++ /* leaf */ ++ if (path[depth].p_ext != ++ EXT_LAST_EXTENT(path[depth].p_hdr)) ++ return path[depth].p_ext[1].ee_block; ++ } else { ++ /* index */ ++ if (path[depth].p_idx != ++ EXT_LAST_INDEX(path[depth].p_hdr)) ++ return path[depth].p_idx[1].ei_block; ++ } ++ depth--; ++ } ++ ++ return EXT_MAX_BLOCK; ++} ++ ++/* ++ * returns first allocated block from next leaf or EXT_MAX_BLOCK ++ */ ++static unsigned ext3_ext_next_leaf_block(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int depth; ++ ++ EXT_ASSERT(path != NULL); ++ depth = path->p_depth; ++ ++ /* zero-tree has no leaf blocks at all */ ++ if (depth == 0) ++ return EXT_MAX_BLOCK; ++ ++ /* go to index block */ ++ depth--; ++ ++ while (depth >= 0) { ++ if (path[depth].p_idx != ++ EXT_LAST_INDEX(path[depth].p_hdr)) ++ return path[depth].p_idx[1].ei_block; ++ depth--; ++ } ++ ++ return EXT_MAX_BLOCK; ++} ++ ++/* ++ * if leaf gets modified and modified extent is first in the leaf ++ * then we have to correct all indexes above ++ * TODO: do we need to correct tree in all cases? ++ */ ++int ext3_ext_correct_indexes(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ struct ext3_extent_header *eh; ++ int depth = EXT_DEPTH(tree); ++ struct ext3_extent *ex; ++ unsigned long border; ++ int k, err = 0; ++ ++ eh = path[depth].p_hdr; ++ ex = path[depth].p_ext; ++ EXT_ASSERT(ex); ++ EXT_ASSERT(eh); ++ ++ if (depth == 0) { ++ /* there is no tree at all */ ++ return 0; ++ } ++ ++ if (ex != EXT_FIRST_EXTENT(eh)) { ++ /* we correct tree if first leaf got modified only */ ++ return 0; ++ } ++ ++ /* ++ * TODO: we need correction if border is smaller then current one ++ */ ++ k = depth - 1; ++ border = path[depth].p_ext->ee_block; ++ if ((err = ext3_ext_get_access(handle, tree, path + k))) ++ return err; ++ path[k].p_idx->ei_block = border; ++ if ((err = ext3_ext_dirty(handle, tree, path + k))) ++ return err; ++ ++ while (k--) { ++ /* change all left-side indexes */ ++ if (path[k+1].p_idx != EXT_FIRST_INDEX(path[k+1].p_hdr)) ++ break; ++ if ((err = ext3_ext_get_access(handle, tree, path + k))) ++ break; ++ path[k].p_idx->ei_block = border; ++ if ((err = ext3_ext_dirty(handle, tree, path + k))) ++ break; ++ } ++ ++ return err; ++} ++ ++static int inline ++ext3_can_extents_be_merged(struct ext3_extents_tree *tree, ++ struct ext3_extent *ex1, ++ struct ext3_extent *ex2) ++{ ++ if (ex1->ee_block + ex1->ee_len != ex2->ee_block) ++ return 0; ++ ++#ifdef AGRESSIVE_TEST ++ if (ex1->ee_len >= 4) ++ return 0; ++#endif ++ ++ if (!tree->ops->mergable) ++ return 1; ++ ++ return tree->ops->mergable(ex1, ex2); ++} ++ ++/* ++ * this routine tries to merge requsted extent into the existing ++ * extent or inserts requested extent as new one into the tree, ++ * creating new leaf in no-space case ++ */ ++int ext3_ext_insert_extent(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext) ++{ ++ struct ext3_extent_header * eh; ++ struct ext3_extent *ex, *fex; ++ struct ext3_extent *nearex; /* nearest extent */ ++ struct ext3_ext_path *npath = NULL; ++ int depth, len, err, next; ++ ++ EXT_ASSERT(newext->ee_len > 0); ++ depth = EXT_DEPTH(tree); ++ ex = path[depth].p_ext; ++ EXT_ASSERT(path[depth].p_hdr); ++ ++ /* try to insert block into found extent and return */ ++ if (ex && ext3_can_extents_be_merged(tree, ex, newext)) { ++ ext_debug(tree, "append %d block to %d:%d (from %d)\n", ++ newext->ee_len, ex->ee_block, ex->ee_len, ++ ex->ee_start); ++ if ((err = ext3_ext_get_access(handle, tree, path + depth))) ++ return err; ++ ex->ee_len += newext->ee_len; ++ eh = path[depth].p_hdr; ++ nearex = ex; ++ goto merge; ++ } ++ ++repeat: ++ depth = EXT_DEPTH(tree); ++ eh = path[depth].p_hdr; ++ if (eh->eh_entries < eh->eh_max) ++ goto has_space; ++ ++ /* probably next leaf has space for us? */ ++ fex = EXT_LAST_EXTENT(eh); ++ next = ext3_ext_next_leaf_block(tree, path); ++ if (newext->ee_block > fex->ee_block && next != EXT_MAX_BLOCK) { ++ ext_debug(tree, "next leaf block - %d\n", next); ++ EXT_ASSERT(!npath); ++ npath = ext3_ext_find_extent(tree, next, NULL); ++ if (IS_ERR(npath)) ++ return PTR_ERR(npath); ++ EXT_ASSERT(npath->p_depth == path->p_depth); ++ eh = npath[depth].p_hdr; ++ if (eh->eh_entries < eh->eh_max) { ++ ext_debug(tree, "next leaf isnt full(%d)\n", ++ eh->eh_entries); ++ path = npath; ++ goto repeat; ++ } ++ ext_debug(tree, "next leaf hasno free space(%d,%d)\n", ++ eh->eh_entries, eh->eh_max); ++ } ++ ++ /* ++ * there is no free space in found leaf ++ * we're gonna add new leaf in the tree ++ */ ++ err = ext3_ext_create_new_leaf(handle, tree, path, newext); ++ if (err) ++ goto cleanup; ++ depth = EXT_DEPTH(tree); ++ eh = path[depth].p_hdr; ++ ++has_space: ++ nearex = path[depth].p_ext; ++ ++ if ((err = ext3_ext_get_access(handle, tree, path + depth))) ++ goto cleanup; ++ ++ if (!nearex) { ++ /* there is no extent in this leaf, create first one */ ++ ext_debug(tree, "first extent in the leaf: %d:%d:%d\n", ++ newext->ee_block, newext->ee_start, ++ newext->ee_len); ++ path[depth].p_ext = EXT_FIRST_EXTENT(eh); ++ } else if (newext->ee_block > nearex->ee_block) { ++ EXT_ASSERT(newext->ee_block != nearex->ee_block); ++ if (nearex != EXT_LAST_EXTENT(eh)) { ++ len = EXT_MAX_EXTENT(eh) - nearex; ++ len = (len - 1) * sizeof(struct ext3_extent); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert %d:%d:%d after: nearest 0x%p, " ++ "move %d from 0x%p to 0x%p\n", ++ newext->ee_block, newext->ee_start, ++ newext->ee_len, ++ nearex, len, nearex + 1, nearex + 2); ++ memmove(nearex + 2, nearex + 1, len); ++ } ++ path[depth].p_ext = nearex + 1; ++ } else { ++ EXT_ASSERT(newext->ee_block != nearex->ee_block); ++ len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert %d:%d:%d before: nearest 0x%p, " ++ "move %d from 0x%p to 0x%p\n", ++ newext->ee_block, newext->ee_start, newext->ee_len, ++ nearex, len, nearex + 1, nearex + 2); ++ memmove(nearex + 1, nearex, len); ++ path[depth].p_ext = nearex; ++ } ++ ++ eh->eh_entries++; ++ nearex = path[depth].p_ext; ++ nearex->ee_block = newext->ee_block; ++ nearex->ee_start = newext->ee_start; ++ nearex->ee_len = newext->ee_len; ++ /* FIXME: support for large fs */ ++ nearex->ee_start_hi = 0; ++ ++merge: ++ /* try to merge extents to the right */ ++ while (nearex < EXT_LAST_EXTENT(eh)) { ++ if (!ext3_can_extents_be_merged(tree, nearex, nearex + 1)) ++ break; ++ /* merge with next extent! */ ++ nearex->ee_len += nearex[1].ee_len; ++ if (nearex + 1 < EXT_LAST_EXTENT(eh)) { ++ len = (EXT_LAST_EXTENT(eh) - nearex - 1) * ++ sizeof(struct ext3_extent); ++ memmove(nearex + 1, nearex + 2, len); ++ } ++ eh->eh_entries--; ++ EXT_ASSERT(eh->eh_entries > 0); ++ } ++ ++ /* try to merge extents to the left */ ++ ++ /* time to correct all indexes above */ ++ err = ext3_ext_correct_indexes(handle, tree, path); ++ if (err) ++ goto cleanup; ++ ++ err = ext3_ext_dirty(handle, tree, path + depth); ++ ++cleanup: ++ if (npath) { ++ ext3_ext_drop_refs(npath); ++ kfree(npath); ++ } ++ ext3_ext_tree_changed(tree); ++ ext3_ext_invalidate_cache(tree); ++ return err; ++} ++ ++int ext3_ext_walk_space(struct ext3_extents_tree *tree, unsigned long block, ++ unsigned long num, ext_prepare_callback func) ++{ ++ struct ext3_ext_path *path = NULL; ++ struct ext3_ext_cache cbex; ++ struct ext3_extent *ex; ++ unsigned long next, start = 0, end = 0; ++ unsigned long last = block + num; ++ int depth, exists, err = 0; ++ ++ EXT_ASSERT(tree); ++ EXT_ASSERT(func); ++ EXT_ASSERT(tree->inode); ++ EXT_ASSERT(tree->root); ++ ++ while (block < last && block != EXT_MAX_BLOCK) { ++ num = last - block; ++ /* find extent for this block */ ++ path = ext3_ext_find_extent(tree, block, path); ++ if (IS_ERR(path)) { ++ err = PTR_ERR(path); ++ path = NULL; ++ break; ++ } ++ ++ depth = EXT_DEPTH(tree); ++ EXT_ASSERT(path[depth].p_hdr); ++ ex = path[depth].p_ext; ++ next = ext3_ext_next_allocated_block(path); ++ ++ exists = 0; ++ if (!ex) { ++ /* there is no extent yet, so try to allocate ++ * all requested space */ ++ start = block; ++ end = block + num; ++ } else if (ex->ee_block > block) { ++ /* need to allocate space before found extent */ ++ start = block; ++ end = ex->ee_block; ++ if (block + num < end) ++ end = block + num; ++ } else if (block >= ex->ee_block + ex->ee_len) { ++ /* need to allocate space after found extent */ ++ start = block; ++ end = block + num; ++ if (end >= next) ++ end = next; ++ } else if (block >= ex->ee_block) { ++ /* ++ * some part of requested space is covered ++ * by found extent ++ */ ++ start = block; ++ end = ex->ee_block + ex->ee_len; ++ if (block + num < end) ++ end = block + num; ++ exists = 1; ++ } else { ++ BUG(); ++ } ++ EXT_ASSERT(end > start); ++ ++ if (!exists) { ++ cbex.ec_block = start; ++ cbex.ec_len = end - start; ++ cbex.ec_start = 0; ++ cbex.ec_type = EXT3_EXT_CACHE_GAP; ++ } else { ++ cbex.ec_block = ex->ee_block; ++ cbex.ec_len = ex->ee_len; ++ cbex.ec_start = ex->ee_start; ++ cbex.ec_type = EXT3_EXT_CACHE_EXTENT; ++ } ++ ++ EXT_ASSERT(cbex.ec_len > 0); ++ EXT_ASSERT(path[depth].p_hdr); ++ err = func(tree, path, &cbex); ++ ext3_ext_drop_refs(path); ++ ++ if (err < 0) ++ break; ++ if (err == EXT_REPEAT) ++ continue; ++ else if (err == EXT_BREAK) { ++ err = 0; ++ break; ++ } ++ ++ if (EXT_DEPTH(tree) != depth) { ++ /* depth was changed. we have to realloc path */ ++ kfree(path); ++ path = NULL; ++ } ++ ++ block = cbex.ec_block + cbex.ec_len; ++ } ++ ++ if (path) { ++ ext3_ext_drop_refs(path); ++ kfree(path); ++ } ++ ++ return err; ++} ++ ++static inline void ++ext3_ext_put_in_cache(struct ext3_extents_tree *tree, __u32 block, ++ __u32 len, __u32 start, int type) ++{ ++ EXT_ASSERT(len > 0); ++ if (tree->cex) { ++ tree->cex->ec_type = type; ++ tree->cex->ec_block = block; ++ tree->cex->ec_len = len; ++ tree->cex->ec_start = start; ++ } ++} ++ ++/* ++ * this routine calculate boundaries of the gap requested block fits into ++ * and cache this gap ++ */ ++static inline void ++ext3_ext_put_gap_in_cache(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ unsigned long block) ++{ ++ int depth = EXT_DEPTH(tree); ++ unsigned long lblock, len; ++ struct ext3_extent *ex; ++ ++ if (!tree->cex) ++ return; ++ ++ ex = path[depth].p_ext; ++ if (ex == NULL) { ++ /* there is no extent yet, so gap is [0;-] */ ++ lblock = 0; ++ len = EXT_MAX_BLOCK; ++ ext_debug(tree, "cache gap(whole file):"); ++ } else if (block < ex->ee_block) { ++ lblock = block; ++ len = ex->ee_block - block; ++ ext_debug(tree, "cache gap(before): %lu [%lu:%lu]", ++ (unsigned long) block, ++ (unsigned long) ex->ee_block, ++ (unsigned long) ex->ee_len); ++ } else if (block >= ex->ee_block + ex->ee_len) { ++ lblock = ex->ee_block + ex->ee_len; ++ len = ext3_ext_next_allocated_block(path); ++ ext_debug(tree, "cache gap(after): [%lu:%lu] %lu", ++ (unsigned long) ex->ee_block, ++ (unsigned long) ex->ee_len, ++ (unsigned long) block); ++ EXT_ASSERT(len > lblock); ++ len = len - lblock; ++ } else { ++ lblock = len = 0; ++ BUG(); ++ } ++ ++ ext_debug(tree, " -> %lu:%lu\n", (unsigned long) lblock, len); ++ ext3_ext_put_in_cache(tree, lblock, len, 0, EXT3_EXT_CACHE_GAP); ++} ++ ++static inline int ++ext3_ext_in_cache(struct ext3_extents_tree *tree, unsigned long block, ++ struct ext3_extent *ex) ++{ ++ struct ext3_ext_cache *cex = tree->cex; ++ ++ /* is there cache storage at all? */ ++ if (!cex) ++ return EXT3_EXT_CACHE_NO; ++ ++ /* has cache valid data? */ ++ if (cex->ec_type == EXT3_EXT_CACHE_NO) ++ return EXT3_EXT_CACHE_NO; ++ ++ EXT_ASSERT(cex->ec_type == EXT3_EXT_CACHE_GAP || ++ cex->ec_type == EXT3_EXT_CACHE_EXTENT); ++ if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) { ++ ex->ee_block = cex->ec_block; ++ ex->ee_start = cex->ec_start; ++ ex->ee_len = cex->ec_len; ++ ext_debug(tree, "%lu cached by %lu:%lu:%lu\n", ++ (unsigned long) block, ++ (unsigned long) ex->ee_block, ++ (unsigned long) ex->ee_len, ++ (unsigned long) ex->ee_start); ++ return cex->ec_type; ++ } ++ ++ /* not in cache */ ++ return EXT3_EXT_CACHE_NO; ++} ++ ++/* ++ * routine removes index from the index block ++ * it's used in truncate case only. thus all requests are for ++ * last index in the block only ++ */ ++int ext3_ext_rm_idx(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ struct buffer_head *bh; ++ int err; ++ ++ /* free index block */ ++ path--; ++ EXT_ASSERT(path->p_hdr->eh_entries); ++ if ((err = ext3_ext_get_access(handle, tree, path))) ++ return err; ++ path->p_hdr->eh_entries--; ++ if ((err = ext3_ext_dirty(handle, tree, path))) ++ return err; ++ ext_debug(tree, "index is empty, remove it, free block %d\n", ++ path->p_idx->ei_leaf); ++ bh = sb_find_get_block(tree->inode->i_sb, path->p_idx->ei_leaf); ++ ext3_forget(handle, 1, tree->inode, bh, path->p_idx->ei_leaf); ++ ext3_free_blocks(handle, tree->inode, path->p_idx->ei_leaf, 1); ++ return err; ++} ++ ++int ext3_ext_calc_credits_for_insert(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int depth = EXT_DEPTH(tree); ++ int needed; ++ ++ if (path) { ++ /* probably there is space in leaf? */ ++ if (path[depth].p_hdr->eh_entries < path[depth].p_hdr->eh_max) ++ return 1; ++ } ++ ++ /* ++ * the worste case we're expecting is creation of the ++ * new root (growing in depth) with index splitting ++ * for splitting we have to consider depth + 1 because ++ * previous growing could increase it ++ */ ++ depth = depth + 1; ++ ++ /* ++ * growing in depth: ++ * block allocation + new root + old root ++ */ ++ needed = EXT3_ALLOC_NEEDED + 2; ++ ++ /* index split. we may need: ++ * allocate intermediate indexes and new leaf ++ * change two blocks at each level, but root ++ * modify root block (inode) ++ */ ++ needed += (depth * EXT3_ALLOC_NEEDED) + (2 * depth) + 1; ++ ++ return needed; ++} ++ ++static int ++ext3_ext_split_for_rm(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, unsigned long start, ++ unsigned long end) ++{ ++ struct ext3_extent *ex, tex; ++ struct ext3_ext_path *npath; ++ int depth, creds, err; ++ ++ depth = EXT_DEPTH(tree); ++ ex = path[depth].p_ext; ++ EXT_ASSERT(ex); ++ EXT_ASSERT(end < ex->ee_block + ex->ee_len - 1); ++ EXT_ASSERT(ex->ee_block < start); ++ ++ /* calculate tail extent */ ++ tex.ee_block = end + 1; ++ EXT_ASSERT(tex.ee_block < ex->ee_block + ex->ee_len); ++ tex.ee_len = ex->ee_block + ex->ee_len - tex.ee_block; ++ ++ creds = ext3_ext_calc_credits_for_insert(tree, path); ++ handle = ext3_ext_journal_restart(handle, creds); ++ if (IS_ERR(handle)) ++ return PTR_ERR(handle); ++ ++ /* calculate head extent. use primary extent */ ++ err = ext3_ext_get_access(handle, tree, path + depth); ++ if (err) ++ return err; ++ ex->ee_len = start - ex->ee_block; ++ err = ext3_ext_dirty(handle, tree, path + depth); ++ if (err) ++ return err; ++ ++ /* FIXME: some callback to free underlying resource ++ * and correct ee_start? */ ++ ext_debug(tree, "split extent: head %u:%u, tail %u:%u\n", ++ ex->ee_block, ex->ee_len, tex.ee_block, tex.ee_len); ++ ++ npath = ext3_ext_find_extent(tree, ex->ee_block, NULL); ++ if (IS_ERR(npath)) ++ return PTR_ERR(npath); ++ depth = EXT_DEPTH(tree); ++ EXT_ASSERT(npath[depth].p_ext->ee_block == ex->ee_block); ++ EXT_ASSERT(npath[depth].p_ext->ee_len == ex->ee_len); ++ ++ err = ext3_ext_insert_extent(handle, tree, npath, &tex); ++ ext3_ext_drop_refs(npath); ++ kfree(npath); ++ ++ return err; ++} ++ ++static int ++ext3_ext_rm_leaf(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, unsigned long start, ++ unsigned long end) ++{ ++ struct ext3_extent *ex, *fu = NULL, *lu, *le; ++ int err = 0, correct_index = 0; ++ int depth = EXT_DEPTH(tree), credits; ++ struct ext3_extent_header *eh; ++ unsigned a, b, block, num; ++ ++ ext_debug(tree, "remove [%lu:%lu] in leaf\n", start, end); ++ if (!path[depth].p_hdr) ++ path[depth].p_hdr = EXT_BLOCK_HDR(path[depth].p_bh); ++ eh = path[depth].p_hdr; ++ EXT_ASSERT(eh); ++ EXT_ASSERT(eh->eh_entries <= eh->eh_max); ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ ++ /* find where to start removing */ ++ le = ex = EXT_LAST_EXTENT(eh); ++ while (ex != EXT_FIRST_EXTENT(eh)) { ++ if (ex->ee_block <= end) ++ break; ++ ex--; ++ } ++ ++ if (start > ex->ee_block && end < ex->ee_block + ex->ee_len - 1) { ++ /* removal of internal part of the extent requested ++ * tail and head must be placed in different extent ++ * so, we have to insert one more extent */ ++ path[depth].p_ext = ex; ++ return ext3_ext_split_for_rm(handle, tree, path, start, end); ++ } ++ ++ lu = ex; ++ while (ex >= EXT_FIRST_EXTENT(eh) && ex->ee_block + ex->ee_len > start) { ++ ext_debug(tree, "remove ext %u:%u\n", ex->ee_block, ex->ee_len); ++ path[depth].p_ext = ex; ++ ++ a = ex->ee_block > start ? ex->ee_block : start; ++ b = ex->ee_block + ex->ee_len - 1 < end ? ++ ex->ee_block + ex->ee_len - 1 : end; ++ ++ ext_debug(tree, " border %u:%u\n", a, b); ++ ++ if (a != ex->ee_block && b != ex->ee_block + ex->ee_len - 1) { ++ block = 0; ++ num = 0; ++ BUG(); ++ } else if (a != ex->ee_block) { ++ /* remove tail of the extent */ ++ block = ex->ee_block; ++ num = a - block; ++ } else if (b != ex->ee_block + ex->ee_len - 1) { ++ /* remove head of the extent */ ++ block = a; ++ num = b - a; ++ } else { ++ /* remove whole extent: excelent! */ ++ block = ex->ee_block; ++ num = 0; ++ EXT_ASSERT(a == ex->ee_block && ++ b == ex->ee_block + ex->ee_len - 1); ++ } ++ ++ if (ex == EXT_FIRST_EXTENT(eh)) ++ correct_index = 1; ++ ++ credits = 1; ++ if (correct_index) ++ credits += (EXT_DEPTH(tree) * EXT3_ALLOC_NEEDED) + 1; ++ if (tree->ops->remove_extent_credits) ++ credits+=tree->ops->remove_extent_credits(tree,ex,a,b); ++ ++ handle = ext3_ext_journal_restart(handle, credits); ++ if (IS_ERR(handle)) { ++ err = PTR_ERR(handle); ++ goto out; ++ } ++ ++ err = ext3_ext_get_access(handle, tree, path + depth); ++ if (err) ++ goto out; ++ ++ if (tree->ops->remove_extent) ++ err = tree->ops->remove_extent(tree, ex, a, b); ++ if (err) ++ goto out; ++ ++ if (num == 0) { ++ /* this extent is removed entirely mark slot unused */ ++ ex->ee_start = 0; ++ eh->eh_entries--; ++ fu = ex; ++ } ++ ++ ex->ee_block = block; ++ ex->ee_len = num; ++ ++ err = ext3_ext_dirty(handle, tree, path + depth); ++ if (err) ++ goto out; ++ ++ ext_debug(tree, "new extent: %u:%u:%u\n", ++ ex->ee_block, ex->ee_len, ex->ee_start); ++ ex--; ++ } ++ ++ if (fu) { ++ /* reuse unused slots */ ++ while (lu < le) { ++ if (lu->ee_start) { ++ *fu = *lu; ++ lu->ee_start = 0; ++ fu++; ++ } ++ lu++; ++ } ++ } ++ ++ if (correct_index && eh->eh_entries) ++ err = ext3_ext_correct_indexes(handle, tree, path); ++ ++ /* if this leaf is free, then we should ++ * remove it from index block above */ ++ if (err == 0 && eh->eh_entries == 0 && path[depth].p_bh != NULL) ++ err = ext3_ext_rm_idx(handle, tree, path + depth); ++ ++out: ++ return err; ++} ++ ++ ++static struct ext3_extent_idx * ++ext3_ext_last_covered(struct ext3_extent_header *hdr, unsigned long block) ++{ ++ struct ext3_extent_idx *ix; ++ ++ ix = EXT_LAST_INDEX(hdr); ++ while (ix != EXT_FIRST_INDEX(hdr)) { ++ if (ix->ei_block <= block) ++ break; ++ ix--; ++ } ++ return ix; ++} ++ ++/* ++ * returns 1 if current index have to be freed (even partial) ++ */ ++static int inline ++ext3_ext_more_to_rm(struct ext3_ext_path *path) ++{ ++ EXT_ASSERT(path->p_idx); ++ ++ if (path->p_idx < EXT_FIRST_INDEX(path->p_hdr)) ++ return 0; ++ ++ /* ++ * if truncate on deeper level happened it it wasn't partial ++ * so we have to consider current index for truncation ++ */ ++ if (path->p_hdr->eh_entries == path->p_block) ++ return 0; ++ return 1; ++} ++ ++int ext3_ext_remove_space(struct ext3_extents_tree *tree, ++ unsigned long start, unsigned long end) ++{ ++ struct inode *inode = tree->inode; ++ struct super_block *sb = inode->i_sb; ++ int depth = EXT_DEPTH(tree); ++ struct ext3_ext_path *path; ++ handle_t *handle; ++ int i = 0, err = 0; ++ ++ ext_debug(tree, "space to be removed: %lu:%lu\n", start, end); ++ ++ /* probably first extent we're gonna free will be last in block */ ++ handle = ext3_journal_start(inode, depth + 1); ++ if (IS_ERR(handle)) ++ return PTR_ERR(handle); ++ ++ ext3_ext_invalidate_cache(tree); ++ ++ /* ++ * we start scanning from right side freeing all the blocks ++ * after i_size and walking into the deep ++ */ ++ path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 1), GFP_KERNEL); ++ if (IS_ERR(path)) { ++ ext3_error(sb, __FUNCTION__, "Can't allocate path array"); ++ ext3_journal_stop(handle); ++ return -ENOMEM; ++ } ++ memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1)); ++ path[i].p_hdr = EXT_ROOT_HDR(tree); ++ ++ while (i >= 0 && err == 0) { ++ if (i == depth) { ++ /* this is leaf block */ ++ err = ext3_ext_rm_leaf(handle, tree, path, start, end); ++ /* root level have p_bh == NULL, brelse() eats this */ ++ brelse(path[i].p_bh); ++ i--; ++ continue; ++ } ++ ++ /* this is index block */ ++ if (!path[i].p_hdr) { ++ ext_debug(tree, "initialize header\n"); ++ path[i].p_hdr = EXT_BLOCK_HDR(path[i].p_bh); ++ } ++ ++ EXT_ASSERT(path[i].p_hdr->eh_entries <= path[i].p_hdr->eh_max); ++ EXT_ASSERT(path[i].p_hdr->eh_magic == EXT3_EXT_MAGIC); ++ ++ if (!path[i].p_idx) { ++ /* this level hasn't touched yet */ ++ path[i].p_idx = ++ ext3_ext_last_covered(path[i].p_hdr, end); ++ path[i].p_block = path[i].p_hdr->eh_entries + 1; ++ ext_debug(tree, "init index ptr: hdr 0x%p, num %d\n", ++ path[i].p_hdr, path[i].p_hdr->eh_entries); ++ } else { ++ /* we've already was here, see at next index */ ++ path[i].p_idx--; ++ } ++ ++ ext_debug(tree, "level %d - index, first 0x%p, cur 0x%p\n", ++ i, EXT_FIRST_INDEX(path[i].p_hdr), ++ path[i].p_idx); ++ if (ext3_ext_more_to_rm(path + i)) { ++ /* go to the next level */ ++ ext_debug(tree, "move to level %d (block %d)\n", ++ i + 1, path[i].p_idx->ei_leaf); ++ memset(path + i + 1, 0, sizeof(*path)); ++ path[i+1].p_bh = sb_bread(sb, path[i].p_idx->ei_leaf); ++ if (!path[i+1].p_bh) { ++ /* should we reset i_size? */ ++ err = -EIO; ++ break; ++ } ++ /* put actual number of indexes to know is this ++ * number got changed at the next iteration */ ++ path[i].p_block = path[i].p_hdr->eh_entries; ++ i++; ++ } else { ++ /* we finish processing this index, go up */ ++ if (path[i].p_hdr->eh_entries == 0 && i > 0) { ++ /* index is empty, remove it ++ * handle must be already prepared by the ++ * truncatei_leaf() */ ++ err = ext3_ext_rm_idx(handle, tree, path + i); ++ } ++ /* root level have p_bh == NULL, brelse() eats this */ ++ brelse(path[i].p_bh); ++ i--; ++ ext_debug(tree, "return to level %d\n", i); ++ } ++ } ++ ++ /* TODO: flexible tree reduction should be here */ ++ if (path->p_hdr->eh_entries == 0) { ++ /* ++ * truncate to zero freed all the tree ++ * so, we need to correct eh_depth ++ */ ++ err = ext3_ext_get_access(handle, tree, path); ++ if (err == 0) { ++ EXT_ROOT_HDR(tree)->eh_depth = 0; ++ EXT_ROOT_HDR(tree)->eh_max = ext3_ext_space_root(tree); ++ err = ext3_ext_dirty(handle, tree, path); ++ } ++ } ++ ext3_ext_tree_changed(tree); ++ ++ kfree(path); ++ ext3_journal_stop(handle); ++ ++ return err; ++} ++ ++int ext3_ext_calc_metadata_amount(struct ext3_extents_tree *tree, int blocks) ++{ ++ int lcap, icap, rcap, leafs, idxs, num; ++ ++ rcap = ext3_ext_space_root(tree); ++ if (blocks <= rcap) { ++ /* all extents fit to the root */ ++ return 0; ++ } ++ ++ rcap = ext3_ext_space_root_idx(tree); ++ lcap = ext3_ext_space_block(tree); ++ icap = ext3_ext_space_block_idx(tree); ++ ++ num = leafs = (blocks + lcap - 1) / lcap; ++ if (leafs <= rcap) { ++ /* all pointers to leafs fit to the root */ ++ return leafs; ++ } ++ ++ /* ok. we need separate index block(s) to link all leaf blocks */ ++ idxs = (leafs + icap - 1) / icap; ++ do { ++ num += idxs; ++ idxs = (idxs + icap - 1) / icap; ++ } while (idxs > rcap); ++ ++ return num; ++} ++ ++/* ++ * called at mount time ++ */ ++void ext3_ext_init(struct super_block *sb) ++{ ++ /* ++ * possible initialization would be here ++ */ ++ ++ if (test_opt(sb, EXTENTS)) { ++ printk("EXT3-fs: file extents enabled"); ++#ifdef AGRESSIVE_TEST ++ printk(", agressive tests"); ++#endif ++#ifdef CHECK_BINSEARCH ++ printk(", check binsearch"); ++#endif ++ printk("\n"); ++ } ++} ++ ++/* ++ * called at umount time ++ */ ++void ext3_ext_release(struct super_block *sb) ++{ ++} ++ ++/************************************************************************ ++ * VFS related routines ++ ************************************************************************/ ++ ++static int ext3_get_inode_write_access(handle_t *handle, void *buffer) ++{ ++ /* we use in-core data, not bh */ ++ return 0; ++} ++ ++static int ext3_mark_buffer_dirty(handle_t *handle, void *buffer) ++{ ++ struct inode *inode = buffer; ++ return ext3_mark_inode_dirty(handle, inode); ++} ++ ++static int ext3_ext_mergable(struct ext3_extent *ex1, ++ struct ext3_extent *ex2) ++{ ++ /* FIXME: support for large fs */ ++ if (ex1->ee_start + ex1->ee_len == ex2->ee_start) ++ return 1; ++ return 0; ++} ++ ++static int ++ext3_remove_blocks_credits(struct ext3_extents_tree *tree, ++ struct ext3_extent *ex, ++ unsigned long from, unsigned long to) ++{ ++ int needed; ++ ++ /* at present, extent can't cross block group */; ++ needed = 4; /* bitmap + group desc + sb + inode */ ++ ++#ifdef CONFIG_QUOTA ++ needed += 2 * EXT3_SINGLEDATA_TRANS_BLOCKS; ++#endif ++ return needed; ++} ++ ++static int ++ext3_remove_blocks(struct ext3_extents_tree *tree, ++ struct ext3_extent *ex, ++ unsigned long from, unsigned long to) ++{ ++ int needed = ext3_remove_blocks_credits(tree, ex, from, to); ++ handle_t *handle = ext3_journal_start(tree->inode, needed); ++ struct buffer_head *bh; ++ int i; ++ ++ if (IS_ERR(handle)) ++ return PTR_ERR(handle); ++ if (from >= ex->ee_block && to == ex->ee_block + ex->ee_len - 1) { ++ /* tail removal */ ++ unsigned long num, start; ++ num = ex->ee_block + ex->ee_len - from; ++ start = ex->ee_start + ex->ee_len - num; ++ ext_debug(tree, "free last %lu blocks starting %lu\n", ++ num, start); ++ for (i = 0; i < num; i++) { ++ bh = sb_find_get_block(tree->inode->i_sb, start + i); ++ ext3_forget(handle, 0, tree->inode, bh, start + i); ++ } ++ ext3_free_blocks(handle, tree->inode, start, num); ++ } else if (from == ex->ee_block && to <= ex->ee_block + ex->ee_len - 1) { ++ printk("strange request: removal %lu-%lu from %u:%u\n", ++ from, to, ex->ee_block, ex->ee_len); ++ } else { ++ printk("strange request: removal(2) %lu-%lu from %u:%u\n", ++ from, to, ex->ee_block, ex->ee_len); ++ } ++ ext3_journal_stop(handle); ++ return 0; ++} ++ ++static int ext3_ext_find_goal(struct inode *inode, ++ struct ext3_ext_path *path, unsigned long block) ++{ ++ struct ext3_inode_info *ei = EXT3_I(inode); ++ unsigned long bg_start; ++ unsigned long colour; ++ int depth; ++ ++ if (path) { ++ struct ext3_extent *ex; ++ depth = path->p_depth; ++ ++ /* try to predict block placement */ ++ if ((ex = path[depth].p_ext)) ++ return ex->ee_start + (block - ex->ee_block); ++ ++ /* it looks index is empty ++ * try to find starting from index itself */ ++ if (path[depth].p_bh) ++ return path[depth].p_bh->b_blocknr; ++ } ++ ++ /* OK. use inode's group */ ++ bg_start = (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) + ++ le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block); ++ colour = (current->pid % 16) * ++ (EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16); ++ return bg_start + colour + block; ++} ++ ++static int ext3_new_block_cb(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *ex, int *err) ++{ ++ struct inode *inode = tree->inode; ++ int newblock, goal; ++ ++ EXT_ASSERT(path); ++ EXT_ASSERT(ex); ++ EXT_ASSERT(ex->ee_start); ++ EXT_ASSERT(ex->ee_len); ++ ++ /* reuse block from the extent to order data/metadata */ ++ newblock = ex->ee_start++; ++ ex->ee_len--; ++ if (ex->ee_len == 0) { ++ ex->ee_len = 1; ++ /* allocate new block for the extent */ ++ goal = ext3_ext_find_goal(inode, path, ex->ee_block); ++ ex->ee_start = ext3_new_block(handle, inode, goal, err); ++ if (ex->ee_start == 0) { ++ /* error occured: restore old extent */ ++ ex->ee_start = newblock; ++ return 0; ++ } ++ } ++ return newblock; ++} ++ ++static struct ext3_extents_helpers ext3_blockmap_helpers = { ++ .get_write_access = ext3_get_inode_write_access, ++ .mark_buffer_dirty = ext3_mark_buffer_dirty, ++ .mergable = ext3_ext_mergable, ++ .new_block = ext3_new_block_cb, ++ .remove_extent = ext3_remove_blocks, ++ .remove_extent_credits = ext3_remove_blocks_credits, ++}; ++ ++void ext3_init_tree_desc(struct ext3_extents_tree *tree, ++ struct inode *inode) ++{ ++ tree->inode = inode; ++ tree->root = (void *) EXT3_I(inode)->i_data; ++ tree->buffer = (void *) inode; ++ tree->buffer_len = sizeof(EXT3_I(inode)->i_data); ++ tree->cex = (struct ext3_ext_cache *) &EXT3_I(inode)->i_cached_extent; ++ tree->ops = &ext3_blockmap_helpers; ++} ++ ++int ext3_ext_get_block(handle_t *handle, struct inode *inode, ++ long iblock, struct buffer_head *bh_result, ++ int create, int extend_disksize) ++{ ++ struct ext3_ext_path *path = NULL; ++ struct ext3_extent newex; ++ struct ext3_extent *ex; ++ int goal, newblock, err = 0, depth; ++ struct ext3_extents_tree tree; ++ ++ clear_buffer_new(bh_result); ++ ext3_init_tree_desc(&tree, inode); ++ ext_debug(&tree, "block %d requested for inode %u\n", ++ (int) iblock, (unsigned) inode->i_ino); ++ down(&EXT3_I(inode)->truncate_sem); ++ ++ /* check in cache */ ++ if ((goal = ext3_ext_in_cache(&tree, iblock, &newex))) { ++ if (goal == EXT3_EXT_CACHE_GAP) { ++ if (!create) { ++ /* block isn't allocated yet and ++ * user don't want to allocate it */ ++ goto out2; ++ } ++ /* we should allocate requested block */ ++ } else if (goal == EXT3_EXT_CACHE_EXTENT) { ++ /* block is already allocated */ ++ newblock = iblock - newex.ee_block + newex.ee_start; ++ goto out; ++ } else { ++ EXT_ASSERT(0); ++ } ++ } ++ ++ /* find extent for this block */ ++ path = ext3_ext_find_extent(&tree, iblock, NULL); ++ if (IS_ERR(path)) { ++ err = PTR_ERR(path); ++ path = NULL; ++ goto out2; ++ } ++ ++ depth = EXT_DEPTH(&tree); ++ ++ /* ++ * consistent leaf must not be empty ++ * this situations is possible, though, _during_ tree modification ++ * this is why assert can't be put in ext3_ext_find_extent() ++ */ ++ EXT_ASSERT(path[depth].p_ext != NULL || depth == 0); ++ ++ if ((ex = path[depth].p_ext)) { ++ /* if found exent covers block, simple return it */ ++ if (iblock >= ex->ee_block && iblock < ex->ee_block + ex->ee_len) { ++ newblock = iblock - ex->ee_block + ex->ee_start; ++ ext_debug(&tree, "%d fit into %d:%d -> %d\n", ++ (int) iblock, ex->ee_block, ex->ee_len, ++ newblock); ++ ext3_ext_put_in_cache(&tree, ex->ee_block, ++ ex->ee_len, ex->ee_start, ++ EXT3_EXT_CACHE_EXTENT); ++ goto out; ++ } ++ } ++ ++ /* ++ * requested block isn't allocated yet ++ * we couldn't try to create block if create flag is zero ++ */ ++ if (!create) { ++ /* put just found gap into cache to speedup subsequest reqs */ ++ ext3_ext_put_gap_in_cache(&tree, path, iblock); ++ goto out2; ++ } ++ ++ /* allocate new block */ ++ goal = ext3_ext_find_goal(inode, path, iblock); ++ newblock = ext3_new_block(handle, inode, goal, &err); ++ if (!newblock) ++ goto out2; ++ ext_debug(&tree, "allocate new block: goal %d, found %d\n", ++ goal, newblock); ++ ++ /* try to insert new extent into found leaf and return */ ++ newex.ee_block = iblock; ++ newex.ee_start = newblock; ++ newex.ee_len = 1; ++ err = ext3_ext_insert_extent(handle, &tree, path, &newex); ++ if (err) ++ goto out2; ++ ++ if (extend_disksize && inode->i_size > EXT3_I(inode)->i_disksize) ++ EXT3_I(inode)->i_disksize = inode->i_size; ++ ++ /* previous routine could use block we allocated */ ++ newblock = newex.ee_start; ++ set_buffer_new(bh_result); ++ ++ ext3_ext_put_in_cache(&tree, newex.ee_block, newex.ee_len, ++ newex.ee_start, EXT3_EXT_CACHE_EXTENT); ++out: ++ ext3_ext_show_leaf(&tree, path); ++ map_bh(bh_result, inode->i_sb, newblock); ++out2: ++ if (path) { ++ ext3_ext_drop_refs(path); ++ kfree(path); ++ } ++ up(&EXT3_I(inode)->truncate_sem); ++ ++ return err; ++} ++ ++void ext3_ext_truncate(struct inode * inode, struct page *page) ++{ ++ struct address_space *mapping = inode->i_mapping; ++ struct super_block *sb = inode->i_sb; ++ struct ext3_extents_tree tree; ++ unsigned long last_block; ++ handle_t *handle; ++ int err = 0; ++ ++ ext3_init_tree_desc(&tree, inode); ++ ++ /* ++ * probably first extent we're gonna free will be last in block ++ */ ++ err = ext3_writepage_trans_blocks(inode) + 3; ++ handle = ext3_journal_start(inode, err); ++ if (IS_ERR(handle)) { ++ if (page) { ++ clear_highpage(page); ++ flush_dcache_page(page); ++ unlock_page(page); ++ page_cache_release(page); ++ } ++ return; ++ } ++ ++ if (page) ++ ext3_block_truncate_page(handle, page, mapping, inode->i_size); ++ ++ down(&EXT3_I(inode)->truncate_sem); ++ ext3_ext_invalidate_cache(&tree); ++ ++ /* ++ * TODO: optimization is possible here ++ * probably we need not scaning at all, ++ * because page truncation is enough ++ */ ++ if (ext3_orphan_add(handle, inode)) ++ goto out_stop; ++ ++ /* we have to know where to truncate from in crash case */ ++ EXT3_I(inode)->i_disksize = inode->i_size; ++ ext3_mark_inode_dirty(handle, inode); ++ ++ last_block = (inode->i_size + sb->s_blocksize - 1) >> ++ EXT3_BLOCK_SIZE_BITS(sb); ++ err = ext3_ext_remove_space(&tree, last_block, EXT_MAX_BLOCK); ++ ++ /* In a multi-transaction truncate, we only make the final ++ * transaction synchronous */ ++ if (IS_SYNC(inode)) ++ handle->h_sync = 1; ++ ++out_stop: ++ /* ++ * If this was a simple ftruncate(), and the file will remain alive ++ * then we need to clear up the orphan record which we created above. ++ * However, if this was a real unlink then we were called by ++ * ext3_delete_inode(), and we allow that function to clean up the ++ * orphan info for us. ++ */ ++ if (inode->i_nlink) ++ ext3_orphan_del(handle, inode); ++ ++ up(&EXT3_I(inode)->truncate_sem); ++ ext3_journal_stop(handle); ++} ++ ++/* ++ * this routine calculate max number of blocks we could modify ++ * in order to allocate new block for an inode ++ */ ++int ext3_ext_writepage_trans_blocks(struct inode *inode, int num) ++{ ++ struct ext3_extents_tree tree; ++ int needed; ++ ++ ext3_init_tree_desc(&tree, inode); ++ ++ needed = ext3_ext_calc_credits_for_insert(&tree, NULL); ++ ++ /* caller want to allocate num blocks */ ++ needed *= num; ++ ++#ifdef CONFIG_QUOTA ++ /* ++ * FIXME: real calculation should be here ++ * it depends on blockmap format of qouta file ++ */ ++ needed += 2 * EXT3_SINGLEDATA_TRANS_BLOCKS; ++#endif ++ ++ return needed; ++} ++ ++void ext3_extents_initialize_blockmap(handle_t *handle, struct inode *inode) ++{ ++ struct ext3_extents_tree tree; ++ ++ ext3_init_tree_desc(&tree, inode); ++ ext3_extent_tree_init(handle, &tree); ++} ++ ++int ext3_ext_calc_blockmap_metadata(struct inode *inode, int blocks) ++{ ++ struct ext3_extents_tree tree; ++ ++ ext3_init_tree_desc(&tree, inode); ++ return ext3_ext_calc_metadata_amount(&tree, blocks); ++} ++ ++static int ++ext3_ext_store_extent_cb(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_ext_cache *newex) ++{ ++ struct ext3_extent_buf *buf = (struct ext3_extent_buf *) tree->private; ++ ++ if (newex->ec_type != EXT3_EXT_CACHE_EXTENT) ++ return EXT_CONTINUE; ++ ++ if (buf->err < 0) ++ return EXT_BREAK; ++ if (buf->cur - buf->buffer + sizeof(*newex) > buf->buflen) ++ return EXT_BREAK; ++ ++ if (!copy_to_user(buf->cur, newex, sizeof(*newex))) { ++ buf->err++; ++ buf->cur += sizeof(*newex); ++ } else { ++ buf->err = -EFAULT; ++ return EXT_BREAK; ++ } ++ return EXT_CONTINUE; ++} ++ ++static int ++ext3_ext_collect_stats_cb(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_ext_cache *ex) ++{ ++ struct ext3_extent_tree_stats *buf = ++ (struct ext3_extent_tree_stats *) tree->private; ++ int depth; ++ ++ if (ex->ec_type != EXT3_EXT_CACHE_EXTENT) ++ return EXT_CONTINUE; ++ ++ depth = EXT_DEPTH(tree); ++ buf->extents_num++; ++ if (path[depth].p_ext == EXT_FIRST_EXTENT(path[depth].p_hdr)) ++ buf->leaf_num++; ++ return EXT_CONTINUE; ++} ++ ++int ext3_ext_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, ++ unsigned long arg) ++{ ++ int err = 0; ++ ++ if (!(EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)) ++ return -EINVAL; ++ ++ if (cmd == EXT3_IOC_GET_EXTENTS) { ++ struct ext3_extent_buf buf; ++ struct ext3_extents_tree tree; ++ ++ if (copy_from_user(&buf, (void *) arg, sizeof(buf))) ++ return -EFAULT; ++ ++ ext3_init_tree_desc(&tree, inode); ++ buf.cur = buf.buffer; ++ buf.err = 0; ++ tree.private = &buf; ++ down(&EXT3_I(inode)->truncate_sem); ++ err = ext3_ext_walk_space(&tree, buf.start, EXT_MAX_BLOCK, ++ ext3_ext_store_extent_cb); ++ up(&EXT3_I(inode)->truncate_sem); ++ if (err == 0) ++ err = buf.err; ++ } else if (cmd == EXT3_IOC_GET_TREE_STATS) { ++ struct ext3_extent_tree_stats buf; ++ struct ext3_extents_tree tree; ++ ++ ext3_init_tree_desc(&tree, inode); ++ down(&EXT3_I(inode)->truncate_sem); ++ buf.depth = EXT_DEPTH(&tree); ++ buf.extents_num = 0; ++ buf.leaf_num = 0; ++ tree.private = &buf; ++ err = ext3_ext_walk_space(&tree, 0, EXT_MAX_BLOCK, ++ ext3_ext_collect_stats_cb); ++ up(&EXT3_I(inode)->truncate_sem); ++ if (!err) ++ err = copy_to_user((void *) arg, &buf, sizeof(buf)); ++ } else if (cmd == EXT3_IOC_GET_TREE_DEPTH) { ++ struct ext3_extents_tree tree; ++ ext3_init_tree_desc(&tree, inode); ++ down(&EXT3_I(inode)->truncate_sem); ++ err = EXT_DEPTH(&tree); ++ up(&EXT3_I(inode)->truncate_sem); ++ } ++ ++ return err; ++} ++ ++EXPORT_SYMBOL(ext3_init_tree_desc); ++EXPORT_SYMBOL(ext3_mark_inode_dirty); ++EXPORT_SYMBOL(ext3_ext_invalidate_cache); ++EXPORT_SYMBOL(ext3_ext_insert_extent); ++EXPORT_SYMBOL(ext3_ext_walk_space); ++EXPORT_SYMBOL(ext3_ext_find_goal); ++EXPORT_SYMBOL(ext3_ext_calc_credits_for_insert); +Index: linux-2.6.16.21-0.8/fs/ext3/ialloc.c +=================================================================== +--- linux-2.6.16.21-0.8.orig/fs/ext3/ialloc.c ++++ linux-2.6.16.21-0.8/fs/ext3/ialloc.c +@@ -598,7 +598,7 @@ got: + ei->i_dir_start_lookup = 0; + ei->i_disksize = 0; + +- ei->i_flags = EXT3_I(dir)->i_flags & ~EXT3_INDEX_FL; ++ ei->i_flags = EXT3_I(dir)->i_flags & ~(EXT3_INDEX_FL|EXT3_EXTENTS_FL); + if (S_ISLNK(mode)) + ei->i_flags &= ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL); + /* dirsync only applies to directories */ +@@ -642,6 +642,18 @@ got: + if (err) + goto fail_free_drop; + ++ if (test_opt(sb, EXTENTS) && S_ISREG(inode->i_mode)) { ++ EXT3_I(inode)->i_flags |= EXT3_EXTENTS_FL; ++ ext3_extents_initialize_blockmap(handle, inode); ++ if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS)) { ++ err = ext3_journal_get_write_access(handle, EXT3_SB(sb)->s_sbh); ++ if (err) goto fail; ++ EXT3_SET_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS); ++ BUFFER_TRACE(EXT3_SB(sb)->s_sbh, "call ext3_journal_dirty_metadata"); ++ err = ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); ++ } ++ } ++ + err = ext3_mark_inode_dirty(handle, inode); + if (err) { + ext3_std_error(sb, err); +Index: linux-2.6.16.21-0.8/fs/ext3/inode.c +=================================================================== +--- linux-2.6.16.21-0.8.orig/fs/ext3/inode.c ++++ linux-2.6.16.21-0.8/fs/ext3/inode.c +@@ -40,7 +40,7 @@ + #include "iopen.h" + #include "acl.h" + +-static int ext3_writepage_trans_blocks(struct inode *inode); ++int ext3_writepage_trans_blocks(struct inode *inode); + + /* + * Test whether an inode is a fast symlink. +@@ -788,6 +788,17 @@ out: + return err; + } + ++static inline int ++ext3_get_block_wrap(handle_t *handle, struct inode *inode, long block, ++ struct buffer_head *bh, int create, int extend_disksize) ++{ ++ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL) ++ return ext3_ext_get_block(handle, inode, block, bh, create, ++ extend_disksize); ++ return ext3_get_block_handle(handle, inode, block, bh, create, ++ extend_disksize); ++} ++ + static int ext3_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) + { +@@ -798,8 +809,8 @@ static int ext3_get_block(struct inode * + handle = ext3_journal_current_handle(); + J_ASSERT(handle != 0); + } +- ret = ext3_get_block_handle(handle, inode, iblock, +- bh_result, create, 1); ++ ret = ext3_get_block_wrap(handle, inode, iblock, ++ bh_result, create, 1); + return ret; + } + +@@ -843,7 +854,7 @@ ext3_direct_io_get_blocks(struct inode * + + get_block: + if (ret == 0) +- ret = ext3_get_block_handle(handle, inode, iblock, ++ ret = ext3_get_block_wrap(handle, inode, iblock, + bh_result, create, 0); + bh_result->b_size = (1 << inode->i_blkbits); + return ret; +@@ -863,7 +874,7 @@ struct buffer_head *ext3_getblk(handle_t + dummy.b_state = 0; + dummy.b_blocknr = -1000; + buffer_trace_init(&dummy.b_history); +- *errp = ext3_get_block_handle(handle, inode, block, &dummy, create, 1); ++ *errp = ext3_get_block_wrap(handle, inode, block, &dummy, create, 1); + if (!*errp && buffer_mapped(&dummy)) { + struct buffer_head *bh; + bh = sb_getblk(inode->i_sb, dummy.b_blocknr); +@@ -1606,7 +1617,7 @@ void ext3_set_aops(struct inode *inode) + * This required during truncate. We need to physically zero the tail end + * of that block so it doesn't yield old data if the file is later grown. + */ +-static int ext3_block_truncate_page(handle_t *handle, struct page *page, ++int ext3_block_truncate_page(handle_t *handle, struct page *page, + struct address_space *mapping, loff_t from) + { + unsigned long index = from >> PAGE_CACHE_SHIFT; +@@ -2116,6 +2127,9 @@ void ext3_truncate(struct inode * inode) + return; + } + ++ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL) ++ return ext3_ext_truncate(inode, page); ++ + handle = start_transaction(inode); + if (IS_ERR(handle)) { + if (page) { +@@ -2863,12 +2877,15 @@ err_out: + * block and work out the exact number of indirects which are touched. Pah. + */ + +-static int ext3_writepage_trans_blocks(struct inode *inode) ++int ext3_writepage_trans_blocks(struct inode *inode) + { + int bpp = ext3_journal_blocks_per_page(inode); + int indirects = (EXT3_NDIR_BLOCKS % bpp) ? 5 : 3; + int ret; + ++ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL) ++ return ext3_ext_writepage_trans_blocks(inode, bpp); ++ + if (ext3_should_journal_data(inode)) + ret = 3 * (bpp + indirects) + 2; + else +Index: linux-2.6.16.21-0.8/fs/ext3/Makefile +=================================================================== +--- linux-2.6.16.21-0.8.orig/fs/ext3/Makefile ++++ linux-2.6.16.21-0.8/fs/ext3/Makefile +@@ -5,7 +5,8 @@ + obj-$(CONFIG_EXT3_FS) += ext3.o + + ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ +- ioctl.o namei.o super.o symlink.o hash.o resize.o ++ ioctl.o namei.o super.o symlink.o hash.o resize.o \ ++ extents.o + + ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o + ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o +Index: linux-2.6.16.21-0.8/fs/ext3/super.c +=================================================================== +--- linux-2.6.16.21-0.8.orig/fs/ext3/super.c ++++ linux-2.6.16.21-0.8/fs/ext3/super.c +@@ -392,6 +392,7 @@ static void ext3_put_super (struct super + struct ext3_super_block *es = sbi->s_es; + int i; + ++ ext3_ext_release(sb); + ext3_xattr_put_super(sb); + journal_destroy(sbi->s_journal); + if (!(sb->s_flags & MS_RDONLY)) { +@@ -456,6 +457,8 @@ static struct inode *ext3_alloc_inode(st + #endif + ei->i_block_alloc_info = NULL; + ei->vfs_inode.i_version = 1; ++ ++ memset(&ei->i_cached_extent, 0, sizeof(ei->i_cached_extent)); + return &ei->vfs_inode; + } + +@@ -638,6 +641,7 @@ enum { + Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, + Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, + Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, ++ Opt_extents, Opt_extdebug, + Opt_grpquota + }; + +@@ -689,6 +693,8 @@ static match_table_t tokens = { + {Opt_iopen, "iopen"}, + {Opt_noiopen, "noiopen"}, + {Opt_iopen_nopriv, "iopen_nopriv"}, ++ {Opt_extents, "extents"}, ++ {Opt_extdebug, "extdebug"}, + {Opt_barrier, "barrier=%u"}, + {Opt_err, NULL}, + {Opt_resize, "resize"}, +@@ -1030,6 +1036,12 @@ clear_qf_name: + case Opt_nobh: + set_opt(sbi->s_mount_opt, NOBH); + break; ++ case Opt_extents: ++ set_opt (sbi->s_mount_opt, EXTENTS); ++ break; ++ case Opt_extdebug: ++ set_opt (sbi->s_mount_opt, EXTDEBUG); ++ break; + default: + printk (KERN_ERR + "EXT3-fs: Unrecognized mount option \"%s\" " +@@ -1756,6 +1768,7 @@ static int ext3_fill_super (struct super + percpu_counter_mod(&sbi->s_dirs_counter, + ext3_count_dirs(sb)); + ++ ext3_ext_init(sb); + lock_kernel(); + return 0; + +Index: linux-2.6.16.21-0.8/fs/ext3/ioctl.c +=================================================================== +--- linux-2.6.16.21-0.8.orig/fs/ext3/ioctl.c ++++ linux-2.6.16.21-0.8/fs/ext3/ioctl.c +@@ -125,6 +125,10 @@ flags_err: + err = ext3_change_inode_journal_flag(inode, jflag); + return err; + } ++ case EXT3_IOC_GET_EXTENTS: ++ case EXT3_IOC_GET_TREE_STATS: ++ case EXT3_IOC_GET_TREE_DEPTH: ++ return ext3_ext_ioctl(inode, filp, cmd, arg); + case EXT3_IOC_GETVERSION: + case EXT3_IOC_GETVERSION_OLD: + return put_user(inode->i_generation, (int __user *) arg); +Index: linux-2.6.16.21-0.8/include/linux/ext3_fs.h +=================================================================== +--- linux-2.6.16.21-0.8.orig/include/linux/ext3_fs.h ++++ linux-2.6.16.21-0.8/include/linux/ext3_fs.h +@@ -185,9 +185,10 @@ struct ext3_group_desc + #define EXT3_NOTAIL_FL 0x00008000 /* file tail should not be merged */ + #define EXT3_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */ + #define EXT3_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/ ++#define EXT3_EXTENTS_FL 0x00080000 /* Inode uses extents */ + #define EXT3_RESERVED_FL 0x80000000 /* reserved for ext3 lib */ + +-#define EXT3_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */ ++#define EXT3_FL_USER_VISIBLE 0x000BDFFF /* User visible flags */ + #define EXT3_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */ + + /* +@@ -237,6 +238,9 @@ struct ext3_new_group_data { + #endif + #define EXT3_IOC_GETRSVSZ _IOR('f', 5, long) + #define EXT3_IOC_SETRSVSZ _IOW('f', 6, long) ++#define EXT3_IOC_GET_EXTENTS _IOR('f', 7, long) ++#define EXT3_IOC_GET_TREE_DEPTH _IOR('f', 8, long) ++#define EXT3_IOC_GET_TREE_STATS _IOR('f', 9, long) + + /* + * Mount options +@@ -377,6 +381,8 @@ struct ext3_inode { + #define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */ + #define EXT3_MOUNT_IOPEN 0x400000 /* Allow access via iopen */ + #define EXT3_MOUNT_IOPEN_NOPRIV 0x800000/* Make iopen world-readable */ ++#define EXT3_MOUNT_EXTENTS 0x1000000/* Extents support */ ++#define EXT3_MOUNT_EXTDEBUG 0x2000000/* Extents debug */ + + /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ + #ifndef clear_opt +@@ -565,11 +571,13 @@ static inline struct ext3_inode_info *EX + #define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 /* Needs recovery */ + #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */ + #define EXT3_FEATURE_INCOMPAT_META_BG 0x0010 ++#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040 /* extents support */ + + #define EXT3_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR + #define EXT3_FEATURE_INCOMPAT_SUPP (EXT3_FEATURE_INCOMPAT_FILETYPE| \ + EXT3_FEATURE_INCOMPAT_RECOVER| \ +- EXT3_FEATURE_INCOMPAT_META_BG) ++ EXT3_FEATURE_INCOMPAT_META_BG| \ ++ EXT3_FEATURE_INCOMPAT_EXTENTS) + #define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \ + EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \ + EXT3_FEATURE_RO_COMPAT_BTREE_DIR) +@@ -776,6 +784,7 @@ extern unsigned long ext3_count_free (st + + + /* inode.c */ ++extern int ext3_block_truncate_page(handle_t *, struct page *, struct address_space *, loff_t); + extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, int); + extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *); + extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *); +@@ -792,6 +801,7 @@ extern int ext3_get_inode_loc(struct ino + extern void ext3_truncate (struct inode *); + extern void ext3_set_inode_flags(struct inode *); + extern void ext3_set_aops(struct inode *inode); ++extern int ext3_writepage_trans_blocks(struct inode *inode); + + /* ioctl.c */ + extern int ext3_ioctl (struct inode *, struct file *, unsigned int, +@@ -845,6 +855,16 @@ extern struct inode_operations ext3_spec + extern struct inode_operations ext3_symlink_inode_operations; + extern struct inode_operations ext3_fast_symlink_inode_operations; + ++/* extents.c */ ++extern int ext3_ext_writepage_trans_blocks(struct inode *, int); ++extern int ext3_ext_get_block(handle_t *, struct inode *, long, ++ struct buffer_head *, int, int); ++extern void ext3_ext_truncate(struct inode *, struct page *); ++extern void ext3_ext_init(struct super_block *); ++extern void ext3_ext_release(struct super_block *); ++extern void ext3_extents_initialize_blockmap(handle_t *, struct inode *); ++extern int ext3_ext_ioctl(struct inode *inode, struct file *filp, ++ unsigned int cmd, unsigned long arg); + + #endif /* __KERNEL__ */ + +Index: linux-2.6.16.21-0.8/include/linux/ext3_extents.h +=================================================================== +--- /dev/null ++++ linux-2.6.16.21-0.8/include/linux/ext3_extents.h +@@ -0,0 +1,264 @@ ++/* ++ * Copyright (c) 2003, Cluster File Systems, Inc, info@clusterfs.com ++ * Written by Alex Tomas ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License version 2 as ++ * published by the Free Software Foundation. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public Licens ++ * along with this program; if not, write to the Free Software ++ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- ++ */ ++ ++#ifndef _LINUX_EXT3_EXTENTS ++#define _LINUX_EXT3_EXTENTS ++ ++/* ++ * with AGRESSIVE_TEST defined capacity of index/leaf blocks ++ * become very little, so index split, in-depth growing and ++ * other hard changes happens much more often ++ * this is for debug purposes only ++ */ ++#define AGRESSIVE_TEST_ ++ ++/* ++ * if CHECK_BINSEARCH defined, then results of binary search ++ * will be checked by linear search ++ */ ++#define CHECK_BINSEARCH_ ++ ++/* ++ * if EXT_DEBUG is defined you can use 'extdebug' mount option ++ * to get lots of info what's going on ++ */ ++#define EXT_DEBUG_ ++#ifdef EXT_DEBUG ++#define ext_debug(tree,fmt,a...) \ ++do { \ ++ if (test_opt((tree)->inode->i_sb, EXTDEBUG)) \ ++ printk(fmt, ##a); \ ++} while (0); ++#else ++#define ext_debug(tree,fmt,a...) ++#endif ++ ++/* ++ * if EXT_STATS is defined then stats numbers are collected ++ * these number will be displayed at umount time ++ */ ++#define EXT_STATS_ ++ ++ ++#define EXT3_ALLOC_NEEDED 3 /* block bitmap + group desc. + sb */ ++ ++/* ++ * ext3_inode has i_block array (total 60 bytes) ++ * first 4 bytes are used to store: ++ * - tree depth (0 mean there is no tree yet. all extents in the inode) ++ * - number of alive extents in the inode ++ */ ++ ++/* ++ * this is extent on-disk structure ++ * it's used at the bottom of the tree ++ */ ++struct ext3_extent { ++ __u32 ee_block; /* first logical block extent covers */ ++ __u16 ee_len; /* number of blocks covered by extent */ ++ __u16 ee_start_hi; /* high 16 bits of physical block */ ++ __u32 ee_start; /* low 32 bigs of physical block */ ++}; ++ ++/* ++ * this is index on-disk structure ++ * it's used at all the levels, but the bottom ++ */ ++struct ext3_extent_idx { ++ __u32 ei_block; /* index covers logical blocks from 'block' */ ++ __u32 ei_leaf; /* pointer to the physical block of the next * ++ * level. leaf or next index could bet here */ ++ __u16 ei_leaf_hi; /* high 16 bits of physical block */ ++ __u16 ei_unused; ++}; ++ ++/* ++ * each block (leaves and indexes), even inode-stored has header ++ */ ++struct ext3_extent_header { ++ __u16 eh_magic; /* probably will support different formats */ ++ __u16 eh_entries; /* number of valid entries */ ++ __u16 eh_max; /* capacity of store in entries */ ++ __u16 eh_depth; /* has tree real underlaying blocks? */ ++ __u32 eh_generation; /* generation of the tree */ ++}; ++ ++#define EXT3_EXT_MAGIC 0xf30a ++ ++/* ++ * array of ext3_ext_path contains path to some extent ++ * creation/lookup routines use it for traversal/splitting/etc ++ * truncate uses it to simulate recursive walking ++ */ ++struct ext3_ext_path { ++ __u32 p_block; ++ __u16 p_depth; ++ struct ext3_extent *p_ext; ++ struct ext3_extent_idx *p_idx; ++ struct ext3_extent_header *p_hdr; ++ struct buffer_head *p_bh; ++}; ++ ++/* ++ * structure for external API ++ */ ++ ++/* ++ * storage for cached extent ++ */ ++struct ext3_ext_cache { ++ __u32 ec_start; ++ __u32 ec_block; ++ __u32 ec_len; ++ __u32 ec_type; ++}; ++ ++#define EXT3_EXT_CACHE_NO 0 ++#define EXT3_EXT_CACHE_GAP 1 ++#define EXT3_EXT_CACHE_EXTENT 2 ++ ++/* ++ * ext3_extents_tree is used to pass initial information ++ * to top-level extents API ++ */ ++struct ext3_extents_helpers; ++struct ext3_extents_tree { ++ struct inode *inode; /* inode which tree belongs to */ ++ void *root; /* ptr to data top of tree resides at */ ++ void *buffer; /* will be passed as arg to ^^ routines */ ++ int buffer_len; ++ void *private; ++ struct ext3_ext_cache *cex;/* last found extent */ ++ struct ext3_extents_helpers *ops; ++}; ++ ++struct ext3_extents_helpers { ++ int (*get_write_access)(handle_t *h, void *buffer); ++ int (*mark_buffer_dirty)(handle_t *h, void *buffer); ++ int (*mergable)(struct ext3_extent *ex1, struct ext3_extent *ex2); ++ int (*remove_extent_credits)(struct ext3_extents_tree *, ++ struct ext3_extent *, unsigned long, ++ unsigned long); ++ int (*remove_extent)(struct ext3_extents_tree *, ++ struct ext3_extent *, unsigned long, ++ unsigned long); ++ int (*new_block)(handle_t *, struct ext3_extents_tree *, ++ struct ext3_ext_path *, struct ext3_extent *, ++ int *); ++}; ++ ++/* ++ * to be called by ext3_ext_walk_space() ++ * negative retcode - error ++ * positive retcode - signal for ext3_ext_walk_space(), see below ++ * callback must return valid extent (passed or newly created) ++ */ ++typedef int (*ext_prepare_callback)(struct ext3_extents_tree *, ++ struct ext3_ext_path *, ++ struct ext3_ext_cache *); ++ ++#define EXT_CONTINUE 0 ++#define EXT_BREAK 1 ++#define EXT_REPEAT 2 ++ ++ ++#define EXT_MAX_BLOCK 0xffffffff ++ ++ ++#define EXT_FIRST_EXTENT(__hdr__) \ ++ ((struct ext3_extent *) (((char *) (__hdr__)) + \ ++ sizeof(struct ext3_extent_header))) ++#define EXT_FIRST_INDEX(__hdr__) \ ++ ((struct ext3_extent_idx *) (((char *) (__hdr__)) + \ ++ sizeof(struct ext3_extent_header))) ++#define EXT_HAS_FREE_INDEX(__path__) \ ++ ((__path__)->p_hdr->eh_entries < (__path__)->p_hdr->eh_max) ++#define EXT_LAST_EXTENT(__hdr__) \ ++ (EXT_FIRST_EXTENT((__hdr__)) + (__hdr__)->eh_entries - 1) ++#define EXT_LAST_INDEX(__hdr__) \ ++ (EXT_FIRST_INDEX((__hdr__)) + (__hdr__)->eh_entries - 1) ++#define EXT_MAX_EXTENT(__hdr__) \ ++ (EXT_FIRST_EXTENT((__hdr__)) + (__hdr__)->eh_max - 1) ++#define EXT_MAX_INDEX(__hdr__) \ ++ (EXT_FIRST_INDEX((__hdr__)) + (__hdr__)->eh_max - 1) ++ ++#define EXT_ROOT_HDR(tree) \ ++ ((struct ext3_extent_header *) (tree)->root) ++#define EXT_BLOCK_HDR(bh) \ ++ ((struct ext3_extent_header *) (bh)->b_data) ++#define EXT_DEPTH(_t_) \ ++ (((struct ext3_extent_header *)((_t_)->root))->eh_depth) ++#define EXT_GENERATION(_t_) \ ++ (((struct ext3_extent_header *)((_t_)->root))->eh_generation) ++ ++ ++#define EXT_ASSERT(__x__) if (!(__x__)) BUG(); ++ ++#define EXT_CHECK_PATH(tree,path) \ ++{ \ ++ int depth = EXT_DEPTH(tree); \ ++ BUG_ON((unsigned long) (path) < __PAGE_OFFSET); \ ++ BUG_ON((unsigned long) (path)[depth].p_idx < \ ++ __PAGE_OFFSET && (path)[depth].p_idx != NULL); \ ++ BUG_ON((unsigned long) (path)[depth].p_ext < \ ++ __PAGE_OFFSET && (path)[depth].p_ext != NULL); \ ++ BUG_ON((unsigned long) (path)[depth].p_hdr < __PAGE_OFFSET); \ ++ BUG_ON((unsigned long) (path)[depth].p_bh < __PAGE_OFFSET \ ++ && depth != 0); \ ++ BUG_ON((path)[0].p_depth != depth); \ ++} ++ ++ ++/* ++ * this structure is used to gather extents from the tree via ioctl ++ */ ++struct ext3_extent_buf { ++ unsigned long start; ++ int buflen; ++ void *buffer; ++ void *cur; ++ int err; ++}; ++ ++/* ++ * this structure is used to collect stats info about the tree ++ */ ++struct ext3_extent_tree_stats { ++ int depth; ++ int extents_num; ++ int leaf_num; ++}; ++ ++extern void ext3_init_tree_desc(struct ext3_extents_tree *, struct inode *); ++extern int ext3_extent_tree_init(handle_t *, struct ext3_extents_tree *); ++extern int ext3_ext_calc_credits_for_insert(struct ext3_extents_tree *, struct ext3_ext_path *); ++extern int ext3_ext_insert_extent(handle_t *, struct ext3_extents_tree *, struct ext3_ext_path *, struct ext3_extent *); ++extern int ext3_ext_walk_space(struct ext3_extents_tree *, unsigned long, unsigned long, ext_prepare_callback); ++extern int ext3_ext_remove_space(struct ext3_extents_tree *, unsigned long, unsigned long); ++extern struct ext3_ext_path * ext3_ext_find_extent(struct ext3_extents_tree *, int, struct ext3_ext_path *); ++extern int ext3_ext_calc_blockmap_metadata(struct inode *, int); ++ ++static inline void ++ext3_ext_invalidate_cache(struct ext3_extents_tree *tree) ++{ ++ if (tree->cex) ++ tree->cex->ec_type = EXT3_EXT_CACHE_NO; ++} ++ ++ ++#endif /* _LINUX_EXT3_EXTENTS */ +Index: linux-2.6.16.21-0.8/include/linux/ext3_fs_i.h +=================================================================== +--- linux-2.6.16.21-0.8.orig/include/linux/ext3_fs_i.h ++++ linux-2.6.16.21-0.8/include/linux/ext3_fs_i.h +@@ -133,6 +133,8 @@ struct ext3_inode_info { + */ + struct semaphore truncate_sem; + struct inode vfs_inode; ++ ++ __u32 i_cached_extent[4]; + }; + + #endif /* _LINUX_EXT3_FS_I */ diff --git a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.18-vanilla.patch b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.18-vanilla.patch new file mode 100644 index 0000000..e89e8e7 --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.18-vanilla.patch @@ -0,0 +1,2935 @@ +Index: linux-stage/fs/ext3/extents.c +=================================================================== +--- /dev/null 1970-01-01 00:00:00.000000000 +0000 ++++ linux-stage/fs/ext3/extents.c 2006-07-16 14:10:21.000000000 +0800 +@@ -0,0 +1,2347 @@ ++/* ++ * Copyright(c) 2003, 2004, 2005, Cluster File Systems, Inc, info@clusterfs.com ++ * Written by Alex Tomas ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License version 2 as ++ * published by the Free Software Foundation. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public Licens ++ * along with this program; if not, write to the Free Software ++ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- ++ */ ++ ++/* ++ * Extents support for EXT3 ++ * ++ * TODO: ++ * - ext3_ext_walk_space() sould not use ext3_ext_find_extent() ++ * - ext3_ext_calc_credits() could take 'mergable' into account ++ * - ext3*_error() should be used in some situations ++ * - find_goal() [to be tested and improved] ++ * - smart tree reduction ++ * - arch-independence ++ * common on-disk format for big/little-endian arch ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++ ++static inline int ext3_ext_check_header(struct ext3_extent_header *eh) ++{ ++ if (eh->eh_magic != EXT3_EXT_MAGIC) { ++ printk(KERN_ERR "EXT3-fs: invalid magic = 0x%x\n", ++ (unsigned)eh->eh_magic); ++ return -EIO; ++ } ++ if (eh->eh_max == 0) { ++ printk(KERN_ERR "EXT3-fs: invalid eh_max = %u\n", ++ (unsigned)eh->eh_max); ++ return -EIO; ++ } ++ if (eh->eh_entries > eh->eh_max) { ++ printk(KERN_ERR "EXT3-fs: invalid eh_entries = %u\n", ++ (unsigned)eh->eh_entries); ++ return -EIO; ++ } ++ return 0; ++} ++ ++static handle_t *ext3_ext_journal_restart(handle_t *handle, int needed) ++{ ++ int err; ++ ++ if (handle->h_buffer_credits > needed) ++ return handle; ++ if (!ext3_journal_extend(handle, needed)) ++ return handle; ++ err = ext3_journal_restart(handle, needed); ++ ++ return handle; ++} ++ ++static int inline ++ext3_ext_get_access_for_root(handle_t *h, struct ext3_extents_tree *tree) ++{ ++ if (tree->ops->get_write_access) ++ return tree->ops->get_write_access(h,tree->buffer); ++ else ++ return 0; ++} ++ ++static int inline ++ext3_ext_mark_root_dirty(handle_t *h, struct ext3_extents_tree *tree) ++{ ++ if (tree->ops->mark_buffer_dirty) ++ return tree->ops->mark_buffer_dirty(h,tree->buffer); ++ else ++ return 0; ++} ++ ++/* ++ * could return: ++ * - EROFS ++ * - ENOMEM ++ */ ++static int ext3_ext_get_access(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int err; ++ ++ if (path->p_bh) { ++ /* path points to block */ ++ err = ext3_journal_get_write_access(handle, path->p_bh); ++ } else { ++ /* path points to leaf/index in inode body */ ++ err = ext3_ext_get_access_for_root(handle, tree); ++ } ++ return err; ++} ++ ++/* ++ * could return: ++ * - EROFS ++ * - ENOMEM ++ * - EIO ++ */ ++static int ext3_ext_dirty(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int err; ++ if (path->p_bh) { ++ /* path points to block */ ++ err =ext3_journal_dirty_metadata(handle, path->p_bh); ++ } else { ++ /* path points to leaf/index in inode body */ ++ err = ext3_ext_mark_root_dirty(handle, tree); ++ } ++ return err; ++} ++ ++static int inline ++ext3_ext_new_block(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, struct ext3_extent *ex, ++ int *err) ++{ ++ int goal, depth, newblock; ++ struct inode *inode; ++ ++ EXT_ASSERT(tree); ++ if (tree->ops->new_block) ++ return tree->ops->new_block(handle, tree, path, ex, err); ++ ++ inode = tree->inode; ++ depth = EXT_DEPTH(tree); ++ if (path && depth > 0) { ++ goal = path[depth-1].p_block; ++ } else { ++ struct ext3_inode_info *ei = EXT3_I(inode); ++ unsigned long bg_start; ++ unsigned long colour; ++ ++ bg_start = (ei->i_block_group * ++ EXT3_BLOCKS_PER_GROUP(inode->i_sb)) + ++ le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block); ++ colour = (current->pid % 16) * ++ (EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16); ++ goal = bg_start + colour; ++ } ++ ++ newblock = ext3_new_block(handle, inode, goal, err); ++ return newblock; ++} ++ ++static inline void ext3_ext_tree_changed(struct ext3_extents_tree *tree) ++{ ++ struct ext3_extent_header *neh; ++ neh = EXT_ROOT_HDR(tree); ++ neh->eh_generation++; ++} ++ ++static inline int ext3_ext_space_block(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->inode->i_sb->s_blocksize - ++ sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent); ++#ifdef AGRESSIVE_TEST ++ size = 6; ++#endif ++ return size; ++} ++ ++static inline int ext3_ext_space_block_idx(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->inode->i_sb->s_blocksize - ++ sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent_idx); ++#ifdef AGRESSIVE_TEST ++ size = 5; ++#endif ++ return size; ++} ++ ++static inline int ext3_ext_space_root(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->buffer_len - sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent); ++#ifdef AGRESSIVE_TEST ++ size = 3; ++#endif ++ return size; ++} ++ ++static inline int ext3_ext_space_root_idx(struct ext3_extents_tree *tree) ++{ ++ int size; ++ ++ size = (tree->buffer_len - sizeof(struct ext3_extent_header)) / ++ sizeof(struct ext3_extent_idx); ++#ifdef AGRESSIVE_TEST ++ size = 4; ++#endif ++ return size; ++} ++ ++static void ext3_ext_show_path(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++#ifdef EXT_DEBUG ++ int k, l = path->p_depth; ++ ++ ext_debug(tree, "path:"); ++ for (k = 0; k <= l; k++, path++) { ++ if (path->p_idx) { ++ ext_debug(tree, " %d->%d", path->p_idx->ei_block, ++ path->p_idx->ei_leaf); ++ } else if (path->p_ext) { ++ ext_debug(tree, " %d:%d:%d", ++ path->p_ext->ee_block, ++ path->p_ext->ee_len, ++ path->p_ext->ee_start); ++ } else ++ ext_debug(tree, " []"); ++ } ++ ext_debug(tree, "\n"); ++#endif ++} ++ ++static void ext3_ext_show_leaf(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++#ifdef EXT_DEBUG ++ int depth = EXT_DEPTH(tree); ++ struct ext3_extent_header *eh; ++ struct ext3_extent *ex; ++ int i; ++ ++ if (!path) ++ return; ++ ++ eh = path[depth].p_hdr; ++ ex = EXT_FIRST_EXTENT(eh); ++ ++ for (i = 0; i < eh->eh_entries; i++, ex++) { ++ ext_debug(tree, "%d:%d:%d ", ++ ex->ee_block, ex->ee_len, ex->ee_start); ++ } ++ ext_debug(tree, "\n"); ++#endif ++} ++ ++static void ext3_ext_drop_refs(struct ext3_ext_path *path) ++{ ++ int depth = path->p_depth; ++ int i; ++ ++ for (i = 0; i <= depth; i++, path++) { ++ if (path->p_bh) { ++ brelse(path->p_bh); ++ path->p_bh = NULL; ++ } ++ } ++} ++ ++/* ++ * binary search for closest index by given block ++ */ ++static inline void ++ext3_ext_binsearch_idx(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, int block) ++{ ++ struct ext3_extent_header *eh = path->p_hdr; ++ struct ext3_extent_idx *ix; ++ int l = 0, k, r; ++ ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ EXT_ASSERT(eh->eh_entries <= eh->eh_max); ++ EXT_ASSERT(eh->eh_entries > 0); ++ ++ ext_debug(tree, "binsearch for %d(idx): ", block); ++ ++ path->p_idx = ix = EXT_FIRST_INDEX(eh); ++ ++ r = k = eh->eh_entries; ++ while (k > 1) { ++ k = (r - l) / 2; ++ if (block < ix[l + k].ei_block) ++ r -= k; ++ else ++ l += k; ++ ext_debug(tree, "%d:%d:%d ", k, l, r); ++ } ++ ++ ix += l; ++ path->p_idx = ix; ++ ext_debug(tree," -> %d->%d ",path->p_idx->ei_block,path->p_idx->ei_leaf); ++ ++ while (l++ < r) { ++ if (block < ix->ei_block) ++ break; ++ path->p_idx = ix++; ++ } ++ ext_debug(tree, " -> %d->%d\n", path->p_idx->ei_block, ++ path->p_idx->ei_leaf); ++ ++#ifdef CHECK_BINSEARCH ++ { ++ struct ext3_extent_idx *chix; ++ ++ chix = ix = EXT_FIRST_INDEX(eh); ++ for (k = 0; k < eh->eh_entries; k++, ix++) { ++ if (k != 0 && ix->ei_block <= ix[-1].ei_block) { ++ printk("k=%d, ix=0x%p, first=0x%p\n", k, ++ ix, EXT_FIRST_INDEX(eh)); ++ printk("%u <= %u\n", ++ ix->ei_block,ix[-1].ei_block); ++ } ++ EXT_ASSERT(k == 0 || ix->ei_block > ix[-1].ei_block); ++ if (block < ix->ei_block) ++ break; ++ chix = ix; ++ } ++ EXT_ASSERT(chix == path->p_idx); ++ } ++#endif ++} ++ ++/* ++ * binary search for closest extent by given block ++ */ ++static inline void ++ext3_ext_binsearch(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, int block) ++{ ++ struct ext3_extent_header *eh = path->p_hdr; ++ struct ext3_extent *ex; ++ int l = 0, k, r; ++ ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ EXT_ASSERT(eh->eh_entries <= eh->eh_max); ++ ++ if (eh->eh_entries == 0) { ++ /* ++ * this leaf is empty yet: ++ * we get such a leaf in split/add case ++ */ ++ return; ++ } ++ ++ ext_debug(tree, "binsearch for %d: ", block); ++ ++ path->p_ext = ex = EXT_FIRST_EXTENT(eh); ++ ++ r = k = eh->eh_entries; ++ while (k > 1) { ++ k = (r - l) / 2; ++ if (block < ex[l + k].ee_block) ++ r -= k; ++ else ++ l += k; ++ ext_debug(tree, "%d:%d:%d ", k, l, r); ++ } ++ ++ ex += l; ++ path->p_ext = ex; ++ ext_debug(tree, " -> %d:%d:%d ", path->p_ext->ee_block, ++ path->p_ext->ee_start, path->p_ext->ee_len); ++ ++ while (l++ < r) { ++ if (block < ex->ee_block) ++ break; ++ path->p_ext = ex++; ++ } ++ ext_debug(tree, " -> %d:%d:%d\n", path->p_ext->ee_block, ++ path->p_ext->ee_start, path->p_ext->ee_len); ++ ++#ifdef CHECK_BINSEARCH ++ { ++ struct ext3_extent *chex; ++ ++ chex = ex = EXT_FIRST_EXTENT(eh); ++ for (k = 0; k < eh->eh_entries; k++, ex++) { ++ EXT_ASSERT(k == 0 || ex->ee_block > ex[-1].ee_block); ++ if (block < ex->ee_block) ++ break; ++ chex = ex; ++ } ++ EXT_ASSERT(chex == path->p_ext); ++ } ++#endif ++} ++ ++int ext3_extent_tree_init(handle_t *handle, struct ext3_extents_tree *tree) ++{ ++ struct ext3_extent_header *eh; ++ ++ BUG_ON(tree->buffer_len == 0); ++ ext3_ext_get_access_for_root(handle, tree); ++ eh = EXT_ROOT_HDR(tree); ++ eh->eh_depth = 0; ++ eh->eh_entries = 0; ++ eh->eh_magic = EXT3_EXT_MAGIC; ++ eh->eh_max = ext3_ext_space_root(tree); ++ ext3_ext_mark_root_dirty(handle, tree); ++ ext3_ext_invalidate_cache(tree); ++ return 0; ++} ++ ++struct ext3_ext_path * ++ext3_ext_find_extent(struct ext3_extents_tree *tree, int block, ++ struct ext3_ext_path *path) ++{ ++ struct ext3_extent_header *eh; ++ struct buffer_head *bh; ++ int depth, i, ppos = 0; ++ ++ EXT_ASSERT(tree); ++ EXT_ASSERT(tree->inode); ++ EXT_ASSERT(tree->root); ++ ++ eh = EXT_ROOT_HDR(tree); ++ EXT_ASSERT(eh); ++ if (ext3_ext_check_header(eh)) ++ goto err; ++ ++ i = depth = EXT_DEPTH(tree); ++ EXT_ASSERT(eh->eh_max); ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ ++ /* account possible depth increase */ ++ if (!path) { ++ path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 2), ++ GFP_NOFS); ++ if (!path) ++ return ERR_PTR(-ENOMEM); ++ } ++ memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1)); ++ path[0].p_hdr = eh; ++ ++ /* walk through the tree */ ++ while (i) { ++ ext_debug(tree, "depth %d: num %d, max %d\n", ++ ppos, eh->eh_entries, eh->eh_max); ++ ext3_ext_binsearch_idx(tree, path + ppos, block); ++ path[ppos].p_block = path[ppos].p_idx->ei_leaf; ++ path[ppos].p_depth = i; ++ path[ppos].p_ext = NULL; ++ ++ bh = sb_bread(tree->inode->i_sb, path[ppos].p_block); ++ if (!bh) ++ goto err; ++ ++ eh = EXT_BLOCK_HDR(bh); ++ ppos++; ++ EXT_ASSERT(ppos <= depth); ++ path[ppos].p_bh = bh; ++ path[ppos].p_hdr = eh; ++ i--; ++ ++ if (ext3_ext_check_header(eh)) ++ goto err; ++ } ++ ++ path[ppos].p_depth = i; ++ path[ppos].p_hdr = eh; ++ path[ppos].p_ext = NULL; ++ path[ppos].p_idx = NULL; ++ ++ if (ext3_ext_check_header(eh)) ++ goto err; ++ ++ /* find extent */ ++ ext3_ext_binsearch(tree, path + ppos, block); ++ ++ ext3_ext_show_path(tree, path); ++ ++ return path; ++ ++err: ++ printk(KERN_ERR "EXT3-fs: header is corrupted!\n"); ++ ext3_ext_drop_refs(path); ++ kfree(path); ++ return ERR_PTR(-EIO); ++} ++ ++/* ++ * insert new index [logical;ptr] into the block at cupr ++ * it check where to insert: before curp or after curp ++ */ ++static int ext3_ext_insert_index(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *curp, ++ int logical, int ptr) ++{ ++ struct ext3_extent_idx *ix; ++ int len, err; ++ ++ if ((err = ext3_ext_get_access(handle, tree, curp))) ++ return err; ++ ++ EXT_ASSERT(logical != curp->p_idx->ei_block); ++ len = EXT_MAX_INDEX(curp->p_hdr) - curp->p_idx; ++ if (logical > curp->p_idx->ei_block) { ++ /* insert after */ ++ if (curp->p_idx != EXT_LAST_INDEX(curp->p_hdr)) { ++ len = (len - 1) * sizeof(struct ext3_extent_idx); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert new index %d after: %d. " ++ "move %d from 0x%p to 0x%p\n", ++ logical, ptr, len, ++ (curp->p_idx + 1), (curp->p_idx + 2)); ++ memmove(curp->p_idx + 2, curp->p_idx + 1, len); ++ } ++ ix = curp->p_idx + 1; ++ } else { ++ /* insert before */ ++ len = len * sizeof(struct ext3_extent_idx); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert new index %d before: %d. " ++ "move %d from 0x%p to 0x%p\n", ++ logical, ptr, len, ++ curp->p_idx, (curp->p_idx + 1)); ++ memmove(curp->p_idx + 1, curp->p_idx, len); ++ ix = curp->p_idx; ++ } ++ ++ ix->ei_block = logical; ++ ix->ei_leaf = ptr; ++ curp->p_hdr->eh_entries++; ++ ++ EXT_ASSERT(curp->p_hdr->eh_entries <= curp->p_hdr->eh_max); ++ EXT_ASSERT(ix <= EXT_LAST_INDEX(curp->p_hdr)); ++ ++ err = ext3_ext_dirty(handle, tree, curp); ++ ext3_std_error(tree->inode->i_sb, err); ++ ++ return err; ++} ++ ++/* ++ * routine inserts new subtree into the path, using free index entry ++ * at depth 'at: ++ * - allocates all needed blocks (new leaf and all intermediate index blocks) ++ * - makes decision where to split ++ * - moves remaining extens and index entries (right to the split point) ++ * into the newly allocated blocks ++ * - initialize subtree ++ */ ++static int ext3_ext_split(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext, int at) ++{ ++ struct buffer_head *bh = NULL; ++ int depth = EXT_DEPTH(tree); ++ struct ext3_extent_header *neh; ++ struct ext3_extent_idx *fidx; ++ struct ext3_extent *ex; ++ int i = at, k, m, a; ++ unsigned long newblock, oldblock, border; ++ int *ablocks = NULL; /* array of allocated blocks */ ++ int err = 0; ++ ++ /* make decision: where to split? */ ++ /* FIXME: now desicion is simplest: at current extent */ ++ ++ /* if current leaf will be splitted, then we should use ++ * border from split point */ ++ EXT_ASSERT(path[depth].p_ext <= EXT_MAX_EXTENT(path[depth].p_hdr)); ++ if (path[depth].p_ext != EXT_MAX_EXTENT(path[depth].p_hdr)) { ++ border = path[depth].p_ext[1].ee_block; ++ ext_debug(tree, "leaf will be splitted." ++ " next leaf starts at %d\n", ++ (int)border); ++ } else { ++ border = newext->ee_block; ++ ext_debug(tree, "leaf will be added." ++ " next leaf starts at %d\n", ++ (int)border); ++ } ++ ++ /* ++ * if error occurs, then we break processing ++ * and turn filesystem read-only. so, index won't ++ * be inserted and tree will be in consistent ++ * state. next mount will repair buffers too ++ */ ++ ++ /* ++ * get array to track all allocated blocks ++ * we need this to handle errors and free blocks ++ * upon them ++ */ ++ ablocks = kmalloc(sizeof(unsigned long) * depth, GFP_NOFS); ++ if (!ablocks) ++ return -ENOMEM; ++ memset(ablocks, 0, sizeof(unsigned long) * depth); ++ ++ /* allocate all needed blocks */ ++ ext_debug(tree, "allocate %d blocks for indexes/leaf\n", depth - at); ++ for (a = 0; a < depth - at; a++) { ++ newblock = ext3_ext_new_block(handle, tree, path, newext, &err); ++ if (newblock == 0) ++ goto cleanup; ++ ablocks[a] = newblock; ++ } ++ ++ /* initialize new leaf */ ++ newblock = ablocks[--a]; ++ EXT_ASSERT(newblock); ++ bh = sb_getblk(tree->inode->i_sb, newblock); ++ if (!bh) { ++ err = -EIO; ++ goto cleanup; ++ } ++ lock_buffer(bh); ++ ++ if ((err = ext3_journal_get_create_access(handle, bh))) ++ goto cleanup; ++ ++ neh = EXT_BLOCK_HDR(bh); ++ neh->eh_entries = 0; ++ neh->eh_max = ext3_ext_space_block(tree); ++ neh->eh_magic = EXT3_EXT_MAGIC; ++ neh->eh_depth = 0; ++ ex = EXT_FIRST_EXTENT(neh); ++ ++ /* move remain of path[depth] to the new leaf */ ++ EXT_ASSERT(path[depth].p_hdr->eh_entries == path[depth].p_hdr->eh_max); ++ /* start copy from next extent */ ++ /* TODO: we could do it by single memmove */ ++ m = 0; ++ path[depth].p_ext++; ++ while (path[depth].p_ext <= ++ EXT_MAX_EXTENT(path[depth].p_hdr)) { ++ ext_debug(tree, "move %d:%d:%d in new leaf %lu\n", ++ path[depth].p_ext->ee_block, ++ path[depth].p_ext->ee_start, ++ path[depth].p_ext->ee_len, ++ newblock); ++ memmove(ex++, path[depth].p_ext++, sizeof(struct ext3_extent)); ++ neh->eh_entries++; ++ m++; ++ } ++ set_buffer_uptodate(bh); ++ unlock_buffer(bh); ++ ++ if ((err = ext3_journal_dirty_metadata(handle, bh))) ++ goto cleanup; ++ brelse(bh); ++ bh = NULL; ++ ++ /* correct old leaf */ ++ if (m) { ++ if ((err = ext3_ext_get_access(handle, tree, path + depth))) ++ goto cleanup; ++ path[depth].p_hdr->eh_entries -= m; ++ if ((err = ext3_ext_dirty(handle, tree, path + depth))) ++ goto cleanup; ++ ++ } ++ ++ /* create intermediate indexes */ ++ k = depth - at - 1; ++ EXT_ASSERT(k >= 0); ++ if (k) ++ ext_debug(tree, "create %d intermediate indices\n", k); ++ /* insert new index into current index block */ ++ /* current depth stored in i var */ ++ i = depth - 1; ++ while (k--) { ++ oldblock = newblock; ++ newblock = ablocks[--a]; ++ bh = sb_getblk(tree->inode->i_sb, newblock); ++ if (!bh) { ++ err = -EIO; ++ goto cleanup; ++ } ++ lock_buffer(bh); ++ ++ if ((err = ext3_journal_get_create_access(handle, bh))) ++ goto cleanup; ++ ++ neh = EXT_BLOCK_HDR(bh); ++ neh->eh_entries = 1; ++ neh->eh_magic = EXT3_EXT_MAGIC; ++ neh->eh_max = ext3_ext_space_block_idx(tree); ++ neh->eh_depth = depth - i; ++ fidx = EXT_FIRST_INDEX(neh); ++ fidx->ei_block = border; ++ fidx->ei_leaf = oldblock; ++ ++ ext_debug(tree, "int.index at %d (block %lu): %lu -> %lu\n", ++ i, newblock, border, oldblock); ++ /* copy indexes */ ++ m = 0; ++ path[i].p_idx++; ++ ++ ext_debug(tree, "cur 0x%p, last 0x%p\n", path[i].p_idx, ++ EXT_MAX_INDEX(path[i].p_hdr)); ++ EXT_ASSERT(EXT_MAX_INDEX(path[i].p_hdr) == ++ EXT_LAST_INDEX(path[i].p_hdr)); ++ while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) { ++ ext_debug(tree, "%d: move %d:%d in new index %lu\n", ++ i, path[i].p_idx->ei_block, ++ path[i].p_idx->ei_leaf, newblock); ++ memmove(++fidx, path[i].p_idx++, ++ sizeof(struct ext3_extent_idx)); ++ neh->eh_entries++; ++ EXT_ASSERT(neh->eh_entries <= neh->eh_max); ++ m++; ++ } ++ set_buffer_uptodate(bh); ++ unlock_buffer(bh); ++ ++ if ((err = ext3_journal_dirty_metadata(handle, bh))) ++ goto cleanup; ++ brelse(bh); ++ bh = NULL; ++ ++ /* correct old index */ ++ if (m) { ++ err = ext3_ext_get_access(handle, tree, path + i); ++ if (err) ++ goto cleanup; ++ path[i].p_hdr->eh_entries -= m; ++ err = ext3_ext_dirty(handle, tree, path + i); ++ if (err) ++ goto cleanup; ++ } ++ ++ i--; ++ } ++ ++ /* insert new index */ ++ if (!err) ++ err = ext3_ext_insert_index(handle, tree, path + at, ++ border, newblock); ++ ++cleanup: ++ if (bh) { ++ if (buffer_locked(bh)) ++ unlock_buffer(bh); ++ brelse(bh); ++ } ++ ++ if (err) { ++ /* free all allocated blocks in error case */ ++ for (i = 0; i < depth; i++) { ++ if (!ablocks[i]) ++ continue; ++ ext3_free_blocks(handle, tree->inode, ablocks[i], 1); ++ } ++ } ++ kfree(ablocks); ++ ++ return err; ++} ++ ++/* ++ * routine implements tree growing procedure: ++ * - allocates new block ++ * - moves top-level data (index block or leaf) into the new block ++ * - initialize new top-level, creating index that points to the ++ * just created block ++ */ ++static int ext3_ext_grow_indepth(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext) ++{ ++ struct ext3_ext_path *curp = path; ++ struct ext3_extent_header *neh; ++ struct ext3_extent_idx *fidx; ++ struct buffer_head *bh; ++ unsigned long newblock; ++ int err = 0; ++ ++ newblock = ext3_ext_new_block(handle, tree, path, newext, &err); ++ if (newblock == 0) ++ return err; ++ ++ bh = sb_getblk(tree->inode->i_sb, newblock); ++ if (!bh) { ++ err = -EIO; ++ ext3_std_error(tree->inode->i_sb, err); ++ return err; ++ } ++ lock_buffer(bh); ++ ++ if ((err = ext3_journal_get_create_access(handle, bh))) { ++ unlock_buffer(bh); ++ goto out; ++ } ++ ++ /* move top-level index/leaf into new block */ ++ memmove(bh->b_data, curp->p_hdr, tree->buffer_len); ++ ++ /* set size of new block */ ++ neh = EXT_BLOCK_HDR(bh); ++ /* old root could have indexes or leaves ++ * so calculate eh_max right way */ ++ if (EXT_DEPTH(tree)) ++ neh->eh_max = ext3_ext_space_block_idx(tree); ++ else ++ neh->eh_max = ext3_ext_space_block(tree); ++ neh->eh_magic = EXT3_EXT_MAGIC; ++ set_buffer_uptodate(bh); ++ unlock_buffer(bh); ++ ++ if ((err = ext3_journal_dirty_metadata(handle, bh))) ++ goto out; ++ ++ /* create index in new top-level index: num,max,pointer */ ++ if ((err = ext3_ext_get_access(handle, tree, curp))) ++ goto out; ++ ++ curp->p_hdr->eh_magic = EXT3_EXT_MAGIC; ++ curp->p_hdr->eh_max = ext3_ext_space_root_idx(tree); ++ curp->p_hdr->eh_entries = 1; ++ curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr); ++ /* FIXME: it works, but actually path[0] can be index */ ++ curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block; ++ curp->p_idx->ei_leaf = newblock; ++ ++ neh = EXT_ROOT_HDR(tree); ++ fidx = EXT_FIRST_INDEX(neh); ++ ext_debug(tree, "new root: num %d(%d), lblock %d, ptr %d\n", ++ neh->eh_entries, neh->eh_max, fidx->ei_block, fidx->ei_leaf); ++ ++ neh->eh_depth = path->p_depth + 1; ++ err = ext3_ext_dirty(handle, tree, curp); ++out: ++ brelse(bh); ++ ++ return err; ++} ++ ++/* ++ * routine finds empty index and adds new leaf. if no free index found ++ * then it requests in-depth growing ++ */ ++static int ext3_ext_create_new_leaf(handle_t *handle, ++ struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext) ++{ ++ struct ext3_ext_path *curp; ++ int depth, i, err = 0; ++ ++repeat: ++ i = depth = EXT_DEPTH(tree); ++ ++ /* walk up to the tree and look for free index entry */ ++ curp = path + depth; ++ while (i > 0 && !EXT_HAS_FREE_INDEX(curp)) { ++ i--; ++ curp--; ++ } ++ ++ /* we use already allocated block for index block ++ * so, subsequent data blocks should be contigoues */ ++ if (EXT_HAS_FREE_INDEX(curp)) { ++ /* if we found index with free entry, then use that ++ * entry: create all needed subtree and add new leaf */ ++ err = ext3_ext_split(handle, tree, path, newext, i); ++ ++ /* refill path */ ++ ext3_ext_drop_refs(path); ++ path = ext3_ext_find_extent(tree, newext->ee_block, path); ++ if (IS_ERR(path)) ++ err = PTR_ERR(path); ++ } else { ++ /* tree is full, time to grow in depth */ ++ err = ext3_ext_grow_indepth(handle, tree, path, newext); ++ ++ /* refill path */ ++ ext3_ext_drop_refs(path); ++ path = ext3_ext_find_extent(tree, newext->ee_block, path); ++ if (IS_ERR(path)) ++ err = PTR_ERR(path); ++ ++ /* ++ * only first (depth 0 -> 1) produces free space ++ * in all other cases we have to split growed tree ++ */ ++ depth = EXT_DEPTH(tree); ++ if (path[depth].p_hdr->eh_entries == path[depth].p_hdr->eh_max) { ++ /* now we need split */ ++ goto repeat; ++ } ++ } ++ ++ if (err) ++ return err; ++ ++ return 0; ++} ++ ++/* ++ * returns allocated block in subsequent extent or EXT_MAX_BLOCK ++ * NOTE: it consider block number from index entry as ++ * allocated block. thus, index entries have to be consistent ++ * with leafs ++ */ ++static unsigned long ++ext3_ext_next_allocated_block(struct ext3_ext_path *path) ++{ ++ int depth; ++ ++ EXT_ASSERT(path != NULL); ++ depth = path->p_depth; ++ ++ if (depth == 0 && path->p_ext == NULL) ++ return EXT_MAX_BLOCK; ++ ++ /* FIXME: what if index isn't full ?! */ ++ while (depth >= 0) { ++ if (depth == path->p_depth) { ++ /* leaf */ ++ if (path[depth].p_ext != ++ EXT_LAST_EXTENT(path[depth].p_hdr)) ++ return path[depth].p_ext[1].ee_block; ++ } else { ++ /* index */ ++ if (path[depth].p_idx != ++ EXT_LAST_INDEX(path[depth].p_hdr)) ++ return path[depth].p_idx[1].ei_block; ++ } ++ depth--; ++ } ++ ++ return EXT_MAX_BLOCK; ++} ++ ++/* ++ * returns first allocated block from next leaf or EXT_MAX_BLOCK ++ */ ++static unsigned ext3_ext_next_leaf_block(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int depth; ++ ++ EXT_ASSERT(path != NULL); ++ depth = path->p_depth; ++ ++ /* zero-tree has no leaf blocks at all */ ++ if (depth == 0) ++ return EXT_MAX_BLOCK; ++ ++ /* go to index block */ ++ depth--; ++ ++ while (depth >= 0) { ++ if (path[depth].p_idx != ++ EXT_LAST_INDEX(path[depth].p_hdr)) ++ return path[depth].p_idx[1].ei_block; ++ depth--; ++ } ++ ++ return EXT_MAX_BLOCK; ++} ++ ++/* ++ * if leaf gets modified and modified extent is first in the leaf ++ * then we have to correct all indexes above ++ * TODO: do we need to correct tree in all cases? ++ */ ++int ext3_ext_correct_indexes(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ struct ext3_extent_header *eh; ++ int depth = EXT_DEPTH(tree); ++ struct ext3_extent *ex; ++ unsigned long border; ++ int k, err = 0; ++ ++ eh = path[depth].p_hdr; ++ ex = path[depth].p_ext; ++ EXT_ASSERT(ex); ++ EXT_ASSERT(eh); ++ ++ if (depth == 0) { ++ /* there is no tree at all */ ++ return 0; ++ } ++ ++ if (ex != EXT_FIRST_EXTENT(eh)) { ++ /* we correct tree if first leaf got modified only */ ++ return 0; ++ } ++ ++ /* ++ * TODO: we need correction if border is smaller then current one ++ */ ++ k = depth - 1; ++ border = path[depth].p_ext->ee_block; ++ if ((err = ext3_ext_get_access(handle, tree, path + k))) ++ return err; ++ path[k].p_idx->ei_block = border; ++ if ((err = ext3_ext_dirty(handle, tree, path + k))) ++ return err; ++ ++ while (k--) { ++ /* change all left-side indexes */ ++ if (path[k+1].p_idx != EXT_FIRST_INDEX(path[k+1].p_hdr)) ++ break; ++ if ((err = ext3_ext_get_access(handle, tree, path + k))) ++ break; ++ path[k].p_idx->ei_block = border; ++ if ((err = ext3_ext_dirty(handle, tree, path + k))) ++ break; ++ } ++ ++ return err; ++} ++ ++static int inline ++ext3_can_extents_be_merged(struct ext3_extents_tree *tree, ++ struct ext3_extent *ex1, ++ struct ext3_extent *ex2) ++{ ++ if (ex1->ee_block + ex1->ee_len != ex2->ee_block) ++ return 0; ++ ++#ifdef AGRESSIVE_TEST ++ if (ex1->ee_len >= 4) ++ return 0; ++#endif ++ ++ if (!tree->ops->mergable) ++ return 1; ++ ++ return tree->ops->mergable(ex1, ex2); ++} ++ ++/* ++ * this routine tries to merge requsted extent into the existing ++ * extent or inserts requested extent as new one into the tree, ++ * creating new leaf in no-space case ++ */ ++int ext3_ext_insert_extent(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *newext) ++{ ++ struct ext3_extent_header * eh; ++ struct ext3_extent *ex, *fex; ++ struct ext3_extent *nearex; /* nearest extent */ ++ struct ext3_ext_path *npath = NULL; ++ int depth, len, err, next; ++ ++ EXT_ASSERT(newext->ee_len > 0); ++ depth = EXT_DEPTH(tree); ++ ex = path[depth].p_ext; ++ EXT_ASSERT(path[depth].p_hdr); ++ ++ /* try to insert block into found extent and return */ ++ if (ex && ext3_can_extents_be_merged(tree, ex, newext)) { ++ ext_debug(tree, "append %d block to %d:%d (from %d)\n", ++ newext->ee_len, ex->ee_block, ex->ee_len, ++ ex->ee_start); ++ if ((err = ext3_ext_get_access(handle, tree, path + depth))) ++ return err; ++ ex->ee_len += newext->ee_len; ++ eh = path[depth].p_hdr; ++ nearex = ex; ++ goto merge; ++ } ++ ++repeat: ++ depth = EXT_DEPTH(tree); ++ eh = path[depth].p_hdr; ++ if (eh->eh_entries < eh->eh_max) ++ goto has_space; ++ ++ /* probably next leaf has space for us? */ ++ fex = EXT_LAST_EXTENT(eh); ++ next = ext3_ext_next_leaf_block(tree, path); ++ if (newext->ee_block > fex->ee_block && next != EXT_MAX_BLOCK) { ++ ext_debug(tree, "next leaf block - %d\n", next); ++ EXT_ASSERT(!npath); ++ npath = ext3_ext_find_extent(tree, next, NULL); ++ if (IS_ERR(npath)) ++ return PTR_ERR(npath); ++ EXT_ASSERT(npath->p_depth == path->p_depth); ++ eh = npath[depth].p_hdr; ++ if (eh->eh_entries < eh->eh_max) { ++ ext_debug(tree, "next leaf isnt full(%d)\n", ++ eh->eh_entries); ++ path = npath; ++ goto repeat; ++ } ++ ext_debug(tree, "next leaf hasno free space(%d,%d)\n", ++ eh->eh_entries, eh->eh_max); ++ } ++ ++ /* ++ * there is no free space in found leaf ++ * we're gonna add new leaf in the tree ++ */ ++ err = ext3_ext_create_new_leaf(handle, tree, path, newext); ++ if (err) ++ goto cleanup; ++ depth = EXT_DEPTH(tree); ++ eh = path[depth].p_hdr; ++ ++has_space: ++ nearex = path[depth].p_ext; ++ ++ if ((err = ext3_ext_get_access(handle, tree, path + depth))) ++ goto cleanup; ++ ++ if (!nearex) { ++ /* there is no extent in this leaf, create first one */ ++ ext_debug(tree, "first extent in the leaf: %d:%d:%d\n", ++ newext->ee_block, newext->ee_start, ++ newext->ee_len); ++ path[depth].p_ext = EXT_FIRST_EXTENT(eh); ++ } else if (newext->ee_block > nearex->ee_block) { ++ EXT_ASSERT(newext->ee_block != nearex->ee_block); ++ if (nearex != EXT_LAST_EXTENT(eh)) { ++ len = EXT_MAX_EXTENT(eh) - nearex; ++ len = (len - 1) * sizeof(struct ext3_extent); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert %d:%d:%d after: nearest 0x%p, " ++ "move %d from 0x%p to 0x%p\n", ++ newext->ee_block, newext->ee_start, ++ newext->ee_len, ++ nearex, len, nearex + 1, nearex + 2); ++ memmove(nearex + 2, nearex + 1, len); ++ } ++ path[depth].p_ext = nearex + 1; ++ } else { ++ EXT_ASSERT(newext->ee_block != nearex->ee_block); ++ len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent); ++ len = len < 0 ? 0 : len; ++ ext_debug(tree, "insert %d:%d:%d before: nearest 0x%p, " ++ "move %d from 0x%p to 0x%p\n", ++ newext->ee_block, newext->ee_start, newext->ee_len, ++ nearex, len, nearex + 1, nearex + 2); ++ memmove(nearex + 1, nearex, len); ++ path[depth].p_ext = nearex; ++ } ++ ++ eh->eh_entries++; ++ nearex = path[depth].p_ext; ++ nearex->ee_block = newext->ee_block; ++ nearex->ee_start = newext->ee_start; ++ nearex->ee_len = newext->ee_len; ++ /* FIXME: support for large fs */ ++ nearex->ee_start_hi = 0; ++ ++merge: ++ /* try to merge extents to the right */ ++ while (nearex < EXT_LAST_EXTENT(eh)) { ++ if (!ext3_can_extents_be_merged(tree, nearex, nearex + 1)) ++ break; ++ /* merge with next extent! */ ++ nearex->ee_len += nearex[1].ee_len; ++ if (nearex + 1 < EXT_LAST_EXTENT(eh)) { ++ len = (EXT_LAST_EXTENT(eh) - nearex - 1) * ++ sizeof(struct ext3_extent); ++ memmove(nearex + 1, nearex + 2, len); ++ } ++ eh->eh_entries--; ++ EXT_ASSERT(eh->eh_entries > 0); ++ } ++ ++ /* try to merge extents to the left */ ++ ++ /* time to correct all indexes above */ ++ err = ext3_ext_correct_indexes(handle, tree, path); ++ if (err) ++ goto cleanup; ++ ++ err = ext3_ext_dirty(handle, tree, path + depth); ++ ++cleanup: ++ if (npath) { ++ ext3_ext_drop_refs(npath); ++ kfree(npath); ++ } ++ ext3_ext_tree_changed(tree); ++ ext3_ext_invalidate_cache(tree); ++ return err; ++} ++ ++int ext3_ext_walk_space(struct ext3_extents_tree *tree, unsigned long block, ++ unsigned long num, ext_prepare_callback func) ++{ ++ struct ext3_ext_path *path = NULL; ++ struct ext3_ext_cache cbex; ++ struct ext3_extent *ex; ++ unsigned long next, start = 0, end = 0; ++ unsigned long last = block + num; ++ int depth, exists, err = 0; ++ ++ EXT_ASSERT(tree); ++ EXT_ASSERT(func); ++ EXT_ASSERT(tree->inode); ++ EXT_ASSERT(tree->root); ++ ++ while (block < last && block != EXT_MAX_BLOCK) { ++ num = last - block; ++ /* find extent for this block */ ++ path = ext3_ext_find_extent(tree, block, path); ++ if (IS_ERR(path)) { ++ err = PTR_ERR(path); ++ path = NULL; ++ break; ++ } ++ ++ depth = EXT_DEPTH(tree); ++ EXT_ASSERT(path[depth].p_hdr); ++ ex = path[depth].p_ext; ++ next = ext3_ext_next_allocated_block(path); ++ ++ exists = 0; ++ if (!ex) { ++ /* there is no extent yet, so try to allocate ++ * all requested space */ ++ start = block; ++ end = block + num; ++ } else if (ex->ee_block > block) { ++ /* need to allocate space before found extent */ ++ start = block; ++ end = ex->ee_block; ++ if (block + num < end) ++ end = block + num; ++ } else if (block >= ex->ee_block + ex->ee_len) { ++ /* need to allocate space after found extent */ ++ start = block; ++ end = block + num; ++ if (end >= next) ++ end = next; ++ } else if (block >= ex->ee_block) { ++ /* ++ * some part of requested space is covered ++ * by found extent ++ */ ++ start = block; ++ end = ex->ee_block + ex->ee_len; ++ if (block + num < end) ++ end = block + num; ++ exists = 1; ++ } else { ++ BUG(); ++ } ++ EXT_ASSERT(end > start); ++ ++ if (!exists) { ++ cbex.ec_block = start; ++ cbex.ec_len = end - start; ++ cbex.ec_start = 0; ++ cbex.ec_type = EXT3_EXT_CACHE_GAP; ++ } else { ++ cbex.ec_block = ex->ee_block; ++ cbex.ec_len = ex->ee_len; ++ cbex.ec_start = ex->ee_start; ++ cbex.ec_type = EXT3_EXT_CACHE_EXTENT; ++ } ++ ++ EXT_ASSERT(cbex.ec_len > 0); ++ EXT_ASSERT(path[depth].p_hdr); ++ err = func(tree, path, &cbex); ++ ext3_ext_drop_refs(path); ++ ++ if (err < 0) ++ break; ++ if (err == EXT_REPEAT) ++ continue; ++ else if (err == EXT_BREAK) { ++ err = 0; ++ break; ++ } ++ ++ if (EXT_DEPTH(tree) != depth) { ++ /* depth was changed. we have to realloc path */ ++ kfree(path); ++ path = NULL; ++ } ++ ++ block = cbex.ec_block + cbex.ec_len; ++ } ++ ++ if (path) { ++ ext3_ext_drop_refs(path); ++ kfree(path); ++ } ++ ++ return err; ++} ++ ++static inline void ++ext3_ext_put_in_cache(struct ext3_extents_tree *tree, __u32 block, ++ __u32 len, __u32 start, int type) ++{ ++ EXT_ASSERT(len > 0); ++ if (tree->cex) { ++ tree->cex->ec_type = type; ++ tree->cex->ec_block = block; ++ tree->cex->ec_len = len; ++ tree->cex->ec_start = start; ++ } ++} ++ ++/* ++ * this routine calculate boundaries of the gap requested block fits into ++ * and cache this gap ++ */ ++static inline void ++ext3_ext_put_gap_in_cache(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ unsigned long block) ++{ ++ int depth = EXT_DEPTH(tree); ++ unsigned long lblock, len; ++ struct ext3_extent *ex; ++ ++ if (!tree->cex) ++ return; ++ ++ ex = path[depth].p_ext; ++ if (ex == NULL) { ++ /* there is no extent yet, so gap is [0;-] */ ++ lblock = 0; ++ len = EXT_MAX_BLOCK; ++ ext_debug(tree, "cache gap(whole file):"); ++ } else if (block < ex->ee_block) { ++ lblock = block; ++ len = ex->ee_block - block; ++ ext_debug(tree, "cache gap(before): %lu [%lu:%lu]", ++ (unsigned long) block, ++ (unsigned long) ex->ee_block, ++ (unsigned long) ex->ee_len); ++ } else if (block >= ex->ee_block + ex->ee_len) { ++ lblock = ex->ee_block + ex->ee_len; ++ len = ext3_ext_next_allocated_block(path); ++ ext_debug(tree, "cache gap(after): [%lu:%lu] %lu", ++ (unsigned long) ex->ee_block, ++ (unsigned long) ex->ee_len, ++ (unsigned long) block); ++ EXT_ASSERT(len > lblock); ++ len = len - lblock; ++ } else { ++ lblock = len = 0; ++ BUG(); ++ } ++ ++ ext_debug(tree, " -> %lu:%lu\n", (unsigned long) lblock, len); ++ ext3_ext_put_in_cache(tree, lblock, len, 0, EXT3_EXT_CACHE_GAP); ++} ++ ++static inline int ++ext3_ext_in_cache(struct ext3_extents_tree *tree, unsigned long block, ++ struct ext3_extent *ex) ++{ ++ struct ext3_ext_cache *cex = tree->cex; ++ ++ /* is there cache storage at all? */ ++ if (!cex) ++ return EXT3_EXT_CACHE_NO; ++ ++ /* has cache valid data? */ ++ if (cex->ec_type == EXT3_EXT_CACHE_NO) ++ return EXT3_EXT_CACHE_NO; ++ ++ EXT_ASSERT(cex->ec_type == EXT3_EXT_CACHE_GAP || ++ cex->ec_type == EXT3_EXT_CACHE_EXTENT); ++ if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) { ++ ex->ee_block = cex->ec_block; ++ ex->ee_start = cex->ec_start; ++ ex->ee_len = cex->ec_len; ++ ext_debug(tree, "%lu cached by %lu:%lu:%lu\n", ++ (unsigned long) block, ++ (unsigned long) ex->ee_block, ++ (unsigned long) ex->ee_len, ++ (unsigned long) ex->ee_start); ++ return cex->ec_type; ++ } ++ ++ /* not in cache */ ++ return EXT3_EXT_CACHE_NO; ++} ++ ++/* ++ * routine removes index from the index block ++ * it's used in truncate case only. thus all requests are for ++ * last index in the block only ++ */ ++int ext3_ext_rm_idx(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ struct buffer_head *bh; ++ int err; ++ ++ /* free index block */ ++ path--; ++ EXT_ASSERT(path->p_hdr->eh_entries); ++ if ((err = ext3_ext_get_access(handle, tree, path))) ++ return err; ++ path->p_hdr->eh_entries--; ++ if ((err = ext3_ext_dirty(handle, tree, path))) ++ return err; ++ ext_debug(tree, "index is empty, remove it, free block %d\n", ++ path->p_idx->ei_leaf); ++ bh = sb_find_get_block(tree->inode->i_sb, path->p_idx->ei_leaf); ++ ext3_forget(handle, 1, tree->inode, bh, path->p_idx->ei_leaf); ++ ext3_free_blocks(handle, tree->inode, path->p_idx->ei_leaf, 1); ++ return err; ++} ++ ++int ext3_ext_calc_credits_for_insert(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path) ++{ ++ int depth = EXT_DEPTH(tree); ++ int needed; ++ ++ if (path) { ++ /* probably there is space in leaf? */ ++ if (path[depth].p_hdr->eh_entries < path[depth].p_hdr->eh_max) ++ return 1; ++ } ++ ++ /* ++ * the worste case we're expecting is creation of the ++ * new root (growing in depth) with index splitting ++ * for splitting we have to consider depth + 1 because ++ * previous growing could increase it ++ */ ++ depth = depth + 1; ++ ++ /* ++ * growing in depth: ++ * block allocation + new root + old root ++ */ ++ needed = EXT3_ALLOC_NEEDED + 2; ++ ++ /* index split. we may need: ++ * allocate intermediate indexes and new leaf ++ * change two blocks at each level, but root ++ * modify root block (inode) ++ */ ++ needed += (depth * EXT3_ALLOC_NEEDED) + (2 * depth) + 1; ++ ++ return needed; ++} ++ ++static int ++ext3_ext_split_for_rm(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, unsigned long start, ++ unsigned long end) ++{ ++ struct ext3_extent *ex, tex; ++ struct ext3_ext_path *npath; ++ int depth, creds, err; ++ ++ depth = EXT_DEPTH(tree); ++ ex = path[depth].p_ext; ++ EXT_ASSERT(ex); ++ EXT_ASSERT(end < ex->ee_block + ex->ee_len - 1); ++ EXT_ASSERT(ex->ee_block < start); ++ ++ /* calculate tail extent */ ++ tex.ee_block = end + 1; ++ EXT_ASSERT(tex.ee_block < ex->ee_block + ex->ee_len); ++ tex.ee_len = ex->ee_block + ex->ee_len - tex.ee_block; ++ ++ creds = ext3_ext_calc_credits_for_insert(tree, path); ++ handle = ext3_ext_journal_restart(handle, creds); ++ if (IS_ERR(handle)) ++ return PTR_ERR(handle); ++ ++ /* calculate head extent. use primary extent */ ++ err = ext3_ext_get_access(handle, tree, path + depth); ++ if (err) ++ return err; ++ ex->ee_len = start - ex->ee_block; ++ err = ext3_ext_dirty(handle, tree, path + depth); ++ if (err) ++ return err; ++ ++ /* FIXME: some callback to free underlying resource ++ * and correct ee_start? */ ++ ext_debug(tree, "split extent: head %u:%u, tail %u:%u\n", ++ ex->ee_block, ex->ee_len, tex.ee_block, tex.ee_len); ++ ++ npath = ext3_ext_find_extent(tree, ex->ee_block, NULL); ++ if (IS_ERR(npath)) ++ return PTR_ERR(npath); ++ depth = EXT_DEPTH(tree); ++ EXT_ASSERT(npath[depth].p_ext->ee_block == ex->ee_block); ++ EXT_ASSERT(npath[depth].p_ext->ee_len == ex->ee_len); ++ ++ err = ext3_ext_insert_extent(handle, tree, npath, &tex); ++ ext3_ext_drop_refs(npath); ++ kfree(npath); ++ ++ return err; ++} ++ ++static int ++ext3_ext_rm_leaf(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, unsigned long start, ++ unsigned long end) ++{ ++ struct ext3_extent *ex, *fu = NULL, *lu, *le; ++ int err = 0, correct_index = 0; ++ int depth = EXT_DEPTH(tree), credits; ++ struct ext3_extent_header *eh; ++ unsigned a, b, block, num; ++ ++ ext_debug(tree, "remove [%lu:%lu] in leaf\n", start, end); ++ if (!path[depth].p_hdr) ++ path[depth].p_hdr = EXT_BLOCK_HDR(path[depth].p_bh); ++ eh = path[depth].p_hdr; ++ EXT_ASSERT(eh); ++ EXT_ASSERT(eh->eh_entries <= eh->eh_max); ++ EXT_ASSERT(eh->eh_magic == EXT3_EXT_MAGIC); ++ ++ /* find where to start removing */ ++ le = ex = EXT_LAST_EXTENT(eh); ++ while (ex != EXT_FIRST_EXTENT(eh)) { ++ if (ex->ee_block <= end) ++ break; ++ ex--; ++ } ++ ++ if (start > ex->ee_block && end < ex->ee_block + ex->ee_len - 1) { ++ /* removal of internal part of the extent requested ++ * tail and head must be placed in different extent ++ * so, we have to insert one more extent */ ++ path[depth].p_ext = ex; ++ return ext3_ext_split_for_rm(handle, tree, path, start, end); ++ } ++ ++ lu = ex; ++ while (ex >= EXT_FIRST_EXTENT(eh) && ex->ee_block + ex->ee_len > start) { ++ ext_debug(tree, "remove ext %u:%u\n", ex->ee_block, ex->ee_len); ++ path[depth].p_ext = ex; ++ ++ a = ex->ee_block > start ? ex->ee_block : start; ++ b = ex->ee_block + ex->ee_len - 1 < end ? ++ ex->ee_block + ex->ee_len - 1 : end; ++ ++ ext_debug(tree, " border %u:%u\n", a, b); ++ ++ if (a != ex->ee_block && b != ex->ee_block + ex->ee_len - 1) { ++ block = 0; ++ num = 0; ++ BUG(); ++ } else if (a != ex->ee_block) { ++ /* remove tail of the extent */ ++ block = ex->ee_block; ++ num = a - block; ++ } else if (b != ex->ee_block + ex->ee_len - 1) { ++ /* remove head of the extent */ ++ block = a; ++ num = b - a; ++ } else { ++ /* remove whole extent: excelent! */ ++ block = ex->ee_block; ++ num = 0; ++ EXT_ASSERT(a == ex->ee_block && ++ b == ex->ee_block + ex->ee_len - 1); ++ } ++ ++ if (ex == EXT_FIRST_EXTENT(eh)) ++ correct_index = 1; ++ ++ credits = 1; ++ if (correct_index) ++ credits += (EXT_DEPTH(tree) * EXT3_ALLOC_NEEDED) + 1; ++ if (tree->ops->remove_extent_credits) ++ credits+=tree->ops->remove_extent_credits(tree,ex,a,b); ++ ++ handle = ext3_ext_journal_restart(handle, credits); ++ if (IS_ERR(handle)) { ++ err = PTR_ERR(handle); ++ goto out; ++ } ++ ++ err = ext3_ext_get_access(handle, tree, path + depth); ++ if (err) ++ goto out; ++ ++ if (tree->ops->remove_extent) ++ err = tree->ops->remove_extent(tree, ex, a, b); ++ if (err) ++ goto out; ++ ++ if (num == 0) { ++ /* this extent is removed entirely mark slot unused */ ++ ex->ee_start = 0; ++ eh->eh_entries--; ++ fu = ex; ++ } ++ ++ ex->ee_block = block; ++ ex->ee_len = num; ++ ++ err = ext3_ext_dirty(handle, tree, path + depth); ++ if (err) ++ goto out; ++ ++ ext_debug(tree, "new extent: %u:%u:%u\n", ++ ex->ee_block, ex->ee_len, ex->ee_start); ++ ex--; ++ } ++ ++ if (fu) { ++ /* reuse unused slots */ ++ while (lu < le) { ++ if (lu->ee_start) { ++ *fu = *lu; ++ lu->ee_start = 0; ++ fu++; ++ } ++ lu++; ++ } ++ } ++ ++ if (correct_index && eh->eh_entries) ++ err = ext3_ext_correct_indexes(handle, tree, path); ++ ++ /* if this leaf is free, then we should ++ * remove it from index block above */ ++ if (err == 0 && eh->eh_entries == 0 && path[depth].p_bh != NULL) ++ err = ext3_ext_rm_idx(handle, tree, path + depth); ++ ++out: ++ return err; ++} ++ ++ ++static struct ext3_extent_idx * ++ext3_ext_last_covered(struct ext3_extent_header *hdr, unsigned long block) ++{ ++ struct ext3_extent_idx *ix; ++ ++ ix = EXT_LAST_INDEX(hdr); ++ while (ix != EXT_FIRST_INDEX(hdr)) { ++ if (ix->ei_block <= block) ++ break; ++ ix--; ++ } ++ return ix; ++} ++ ++/* ++ * returns 1 if current index have to be freed (even partial) ++ */ ++static int inline ++ext3_ext_more_to_rm(struct ext3_ext_path *path) ++{ ++ EXT_ASSERT(path->p_idx); ++ ++ if (path->p_idx < EXT_FIRST_INDEX(path->p_hdr)) ++ return 0; ++ ++ /* ++ * if truncate on deeper level happened it it wasn't partial ++ * so we have to consider current index for truncation ++ */ ++ if (path->p_hdr->eh_entries == path->p_block) ++ return 0; ++ return 1; ++} ++ ++int ext3_ext_remove_space(struct ext3_extents_tree *tree, ++ unsigned long start, unsigned long end) ++{ ++ struct inode *inode = tree->inode; ++ struct super_block *sb = inode->i_sb; ++ int depth = EXT_DEPTH(tree); ++ struct ext3_ext_path *path; ++ handle_t *handle; ++ int i = 0, err = 0; ++ ++ ext_debug(tree, "space to be removed: %lu:%lu\n", start, end); ++ ++ /* probably first extent we're gonna free will be last in block */ ++ handle = ext3_journal_start(inode, depth + 1); ++ if (IS_ERR(handle)) ++ return PTR_ERR(handle); ++ ++ ext3_ext_invalidate_cache(tree); ++ ++ /* ++ * we start scanning from right side freeing all the blocks ++ * after i_size and walking into the deep ++ */ ++ path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 1), GFP_KERNEL); ++ if (IS_ERR(path)) { ++ ext3_error(sb, __FUNCTION__, "Can't allocate path array"); ++ ext3_journal_stop(handle); ++ return -ENOMEM; ++ } ++ memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1)); ++ path[i].p_hdr = EXT_ROOT_HDR(tree); ++ ++ while (i >= 0 && err == 0) { ++ if (i == depth) { ++ /* this is leaf block */ ++ err = ext3_ext_rm_leaf(handle, tree, path, start, end); ++ /* root level have p_bh == NULL, brelse() eats this */ ++ brelse(path[i].p_bh); ++ i--; ++ continue; ++ } ++ ++ /* this is index block */ ++ if (!path[i].p_hdr) { ++ ext_debug(tree, "initialize header\n"); ++ path[i].p_hdr = EXT_BLOCK_HDR(path[i].p_bh); ++ } ++ ++ EXT_ASSERT(path[i].p_hdr->eh_entries <= path[i].p_hdr->eh_max); ++ EXT_ASSERT(path[i].p_hdr->eh_magic == EXT3_EXT_MAGIC); ++ ++ if (!path[i].p_idx) { ++ /* this level hasn't touched yet */ ++ path[i].p_idx = ++ ext3_ext_last_covered(path[i].p_hdr, end); ++ path[i].p_block = path[i].p_hdr->eh_entries + 1; ++ ext_debug(tree, "init index ptr: hdr 0x%p, num %d\n", ++ path[i].p_hdr, path[i].p_hdr->eh_entries); ++ } else { ++ /* we've already was here, see at next index */ ++ path[i].p_idx--; ++ } ++ ++ ext_debug(tree, "level %d - index, first 0x%p, cur 0x%p\n", ++ i, EXT_FIRST_INDEX(path[i].p_hdr), ++ path[i].p_idx); ++ if (ext3_ext_more_to_rm(path + i)) { ++ /* go to the next level */ ++ ext_debug(tree, "move to level %d (block %d)\n", ++ i + 1, path[i].p_idx->ei_leaf); ++ memset(path + i + 1, 0, sizeof(*path)); ++ path[i+1].p_bh = sb_bread(sb, path[i].p_idx->ei_leaf); ++ if (!path[i+1].p_bh) { ++ /* should we reset i_size? */ ++ err = -EIO; ++ break; ++ } ++ /* put actual number of indexes to know is this ++ * number got changed at the next iteration */ ++ path[i].p_block = path[i].p_hdr->eh_entries; ++ i++; ++ } else { ++ /* we finish processing this index, go up */ ++ if (path[i].p_hdr->eh_entries == 0 && i > 0) { ++ /* index is empty, remove it ++ * handle must be already prepared by the ++ * truncatei_leaf() */ ++ err = ext3_ext_rm_idx(handle, tree, path + i); ++ } ++ /* root level have p_bh == NULL, brelse() eats this */ ++ brelse(path[i].p_bh); ++ i--; ++ ext_debug(tree, "return to level %d\n", i); ++ } ++ } ++ ++ /* TODO: flexible tree reduction should be here */ ++ if (path->p_hdr->eh_entries == 0) { ++ /* ++ * truncate to zero freed all the tree ++ * so, we need to correct eh_depth ++ */ ++ err = ext3_ext_get_access(handle, tree, path); ++ if (err == 0) { ++ EXT_ROOT_HDR(tree)->eh_depth = 0; ++ EXT_ROOT_HDR(tree)->eh_max = ext3_ext_space_root(tree); ++ err = ext3_ext_dirty(handle, tree, path); ++ } ++ } ++ ext3_ext_tree_changed(tree); ++ ++ kfree(path); ++ ext3_journal_stop(handle); ++ ++ return err; ++} ++ ++int ext3_ext_calc_metadata_amount(struct ext3_extents_tree *tree, int blocks) ++{ ++ int lcap, icap, rcap, leafs, idxs, num; ++ ++ rcap = ext3_ext_space_root(tree); ++ if (blocks <= rcap) { ++ /* all extents fit to the root */ ++ return 0; ++ } ++ ++ rcap = ext3_ext_space_root_idx(tree); ++ lcap = ext3_ext_space_block(tree); ++ icap = ext3_ext_space_block_idx(tree); ++ ++ num = leafs = (blocks + lcap - 1) / lcap; ++ if (leafs <= rcap) { ++ /* all pointers to leafs fit to the root */ ++ return leafs; ++ } ++ ++ /* ok. we need separate index block(s) to link all leaf blocks */ ++ idxs = (leafs + icap - 1) / icap; ++ do { ++ num += idxs; ++ idxs = (idxs + icap - 1) / icap; ++ } while (idxs > rcap); ++ ++ return num; ++} ++ ++/* ++ * called at mount time ++ */ ++void ext3_ext_init(struct super_block *sb) ++{ ++ /* ++ * possible initialization would be here ++ */ ++ ++ if (test_opt(sb, EXTENTS)) { ++ printk("EXT3-fs: file extents enabled"); ++#ifdef AGRESSIVE_TEST ++ printk(", agressive tests"); ++#endif ++#ifdef CHECK_BINSEARCH ++ printk(", check binsearch"); ++#endif ++ printk("\n"); ++ } ++} ++ ++/* ++ * called at umount time ++ */ ++void ext3_ext_release(struct super_block *sb) ++{ ++} ++ ++/************************************************************************ ++ * VFS related routines ++ ************************************************************************/ ++ ++static int ext3_get_inode_write_access(handle_t *handle, void *buffer) ++{ ++ /* we use in-core data, not bh */ ++ return 0; ++} ++ ++static int ext3_mark_buffer_dirty(handle_t *handle, void *buffer) ++{ ++ struct inode *inode = buffer; ++ return ext3_mark_inode_dirty(handle, inode); ++} ++ ++static int ext3_ext_mergable(struct ext3_extent *ex1, ++ struct ext3_extent *ex2) ++{ ++ /* FIXME: support for large fs */ ++ if (ex1->ee_start + ex1->ee_len == ex2->ee_start) ++ return 1; ++ return 0; ++} ++ ++static int ++ext3_remove_blocks_credits(struct ext3_extents_tree *tree, ++ struct ext3_extent *ex, ++ unsigned long from, unsigned long to) ++{ ++ int needed; ++ ++ /* at present, extent can't cross block group */; ++ needed = 4; /* bitmap + group desc + sb + inode */ ++ ++#ifdef CONFIG_QUOTA ++ needed += 2 * EXT3_SINGLEDATA_TRANS_BLOCKS; ++#endif ++ return needed; ++} ++ ++static int ++ext3_remove_blocks(struct ext3_extents_tree *tree, ++ struct ext3_extent *ex, ++ unsigned long from, unsigned long to) ++{ ++ int needed = ext3_remove_blocks_credits(tree, ex, from, to); ++ handle_t *handle = ext3_journal_start(tree->inode, needed); ++ struct buffer_head *bh; ++ int i; ++ ++ if (IS_ERR(handle)) ++ return PTR_ERR(handle); ++ if (from >= ex->ee_block && to == ex->ee_block + ex->ee_len - 1) { ++ /* tail removal */ ++ unsigned long num, start; ++ num = ex->ee_block + ex->ee_len - from; ++ start = ex->ee_start + ex->ee_len - num; ++ ext_debug(tree, "free last %lu blocks starting %lu\n", ++ num, start); ++ for (i = 0; i < num; i++) { ++ bh = sb_find_get_block(tree->inode->i_sb, start + i); ++ ext3_forget(handle, 0, tree->inode, bh, start + i); ++ } ++ ext3_free_blocks(handle, tree->inode, start, num); ++ } else if (from == ex->ee_block && to <= ex->ee_block + ex->ee_len - 1) { ++ printk("strange request: removal %lu-%lu from %u:%u\n", ++ from, to, ex->ee_block, ex->ee_len); ++ } else { ++ printk("strange request: removal(2) %lu-%lu from %u:%u\n", ++ from, to, ex->ee_block, ex->ee_len); ++ } ++ ext3_journal_stop(handle); ++ return 0; ++} ++ ++static int ext3_ext_find_goal(struct inode *inode, ++ struct ext3_ext_path *path, unsigned long block) ++{ ++ struct ext3_inode_info *ei = EXT3_I(inode); ++ unsigned long bg_start; ++ unsigned long colour; ++ int depth; ++ ++ if (path) { ++ struct ext3_extent *ex; ++ depth = path->p_depth; ++ ++ /* try to predict block placement */ ++ if ((ex = path[depth].p_ext)) ++ return ex->ee_start + (block - ex->ee_block); ++ ++ /* it looks index is empty ++ * try to find starting from index itself */ ++ if (path[depth].p_bh) ++ return path[depth].p_bh->b_blocknr; ++ } ++ ++ /* OK. use inode's group */ ++ bg_start = (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) + ++ le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block); ++ colour = (current->pid % 16) * ++ (EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16); ++ return bg_start + colour + block; ++} ++ ++static int ext3_new_block_cb(handle_t *handle, struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_extent *ex, int *err) ++{ ++ struct inode *inode = tree->inode; ++ int newblock, goal; ++ ++ EXT_ASSERT(path); ++ EXT_ASSERT(ex); ++ EXT_ASSERT(ex->ee_start); ++ EXT_ASSERT(ex->ee_len); ++ ++ /* reuse block from the extent to order data/metadata */ ++ newblock = ex->ee_start++; ++ ex->ee_len--; ++ if (ex->ee_len == 0) { ++ ex->ee_len = 1; ++ /* allocate new block for the extent */ ++ goal = ext3_ext_find_goal(inode, path, ex->ee_block); ++ ex->ee_start = ext3_new_block(handle, inode, goal, err); ++ if (ex->ee_start == 0) { ++ /* error occured: restore old extent */ ++ ex->ee_start = newblock; ++ return 0; ++ } ++ } ++ return newblock; ++} ++ ++static struct ext3_extents_helpers ext3_blockmap_helpers = { ++ .get_write_access = ext3_get_inode_write_access, ++ .mark_buffer_dirty = ext3_mark_buffer_dirty, ++ .mergable = ext3_ext_mergable, ++ .new_block = ext3_new_block_cb, ++ .remove_extent = ext3_remove_blocks, ++ .remove_extent_credits = ext3_remove_blocks_credits, ++}; ++ ++void ext3_init_tree_desc(struct ext3_extents_tree *tree, ++ struct inode *inode) ++{ ++ tree->inode = inode; ++ tree->root = (void *) EXT3_I(inode)->i_data; ++ tree->buffer = (void *) inode; ++ tree->buffer_len = sizeof(EXT3_I(inode)->i_data); ++ tree->cex = (struct ext3_ext_cache *) &EXT3_I(inode)->i_cached_extent; ++ tree->ops = &ext3_blockmap_helpers; ++} ++ ++int ext3_ext_get_block(handle_t *handle, struct inode *inode, ++ long iblock, struct buffer_head *bh_result, ++ int create, int extend_disksize) ++{ ++ struct ext3_ext_path *path = NULL; ++ struct ext3_extent newex; ++ struct ext3_extent *ex; ++ int goal, newblock, err = 0, depth; ++ struct ext3_extents_tree tree; ++ ++ clear_buffer_new(bh_result); ++ ext3_init_tree_desc(&tree, inode); ++ ext_debug(&tree, "block %d requested for inode %u\n", ++ (int) iblock, (unsigned) inode->i_ino); ++ mutex_lock(&EXT3_I(inode)->truncate_mutex); ++ ++ /* check in cache */ ++ if ((goal = ext3_ext_in_cache(&tree, iblock, &newex))) { ++ if (goal == EXT3_EXT_CACHE_GAP) { ++ if (!create) { ++ /* block isn't allocated yet and ++ * user don't want to allocate it */ ++ goto out2; ++ } ++ /* we should allocate requested block */ ++ } else if (goal == EXT3_EXT_CACHE_EXTENT) { ++ /* block is already allocated */ ++ newblock = iblock - newex.ee_block + newex.ee_start; ++ goto out; ++ } else { ++ EXT_ASSERT(0); ++ } ++ } ++ ++ /* find extent for this block */ ++ path = ext3_ext_find_extent(&tree, iblock, NULL); ++ if (IS_ERR(path)) { ++ err = PTR_ERR(path); ++ path = NULL; ++ goto out2; ++ } ++ ++ depth = EXT_DEPTH(&tree); ++ ++ /* ++ * consistent leaf must not be empty ++ * this situations is possible, though, _during_ tree modification ++ * this is why assert can't be put in ext3_ext_find_extent() ++ */ ++ EXT_ASSERT(path[depth].p_ext != NULL || depth == 0); ++ ++ if ((ex = path[depth].p_ext)) { ++ /* if found exent covers block, simple return it */ ++ if (iblock >= ex->ee_block && iblock < ex->ee_block + ex->ee_len) { ++ newblock = iblock - ex->ee_block + ex->ee_start; ++ ext_debug(&tree, "%d fit into %d:%d -> %d\n", ++ (int) iblock, ex->ee_block, ex->ee_len, ++ newblock); ++ ext3_ext_put_in_cache(&tree, ex->ee_block, ++ ex->ee_len, ex->ee_start, ++ EXT3_EXT_CACHE_EXTENT); ++ goto out; ++ } ++ } ++ ++ /* ++ * requested block isn't allocated yet ++ * we couldn't try to create block if create flag is zero ++ */ ++ if (!create) { ++ /* put just found gap into cache to speedup subsequest reqs */ ++ ext3_ext_put_gap_in_cache(&tree, path, iblock); ++ goto out2; ++ } ++ ++ /* allocate new block */ ++ goal = ext3_ext_find_goal(inode, path, iblock); ++ newblock = ext3_new_block(handle, inode, goal, &err); ++ if (!newblock) ++ goto out2; ++ ext_debug(&tree, "allocate new block: goal %d, found %d\n", ++ goal, newblock); ++ ++ /* try to insert new extent into found leaf and return */ ++ newex.ee_block = iblock; ++ newex.ee_start = newblock; ++ newex.ee_len = 1; ++ err = ext3_ext_insert_extent(handle, &tree, path, &newex); ++ if (err) ++ goto out2; ++ ++ if (extend_disksize && inode->i_size > EXT3_I(inode)->i_disksize) ++ EXT3_I(inode)->i_disksize = inode->i_size; ++ ++ /* previous routine could use block we allocated */ ++ newblock = newex.ee_start; ++ set_buffer_new(bh_result); ++ ++ ext3_ext_put_in_cache(&tree, newex.ee_block, newex.ee_len, ++ newex.ee_start, EXT3_EXT_CACHE_EXTENT); ++out: ++ ext3_ext_show_leaf(&tree, path); ++ map_bh(bh_result, inode->i_sb, newblock); ++out2: ++ if (path) { ++ ext3_ext_drop_refs(path); ++ kfree(path); ++ } ++ mutex_unlock(&EXT3_I(inode)->truncate_mutex); ++ ++ return err; ++} ++ ++void ext3_ext_truncate(struct inode * inode, struct page *page) ++{ ++ struct address_space *mapping = inode->i_mapping; ++ struct super_block *sb = inode->i_sb; ++ struct ext3_extents_tree tree; ++ unsigned long last_block; ++ handle_t *handle; ++ int err = 0; ++ ++ ext3_init_tree_desc(&tree, inode); ++ ++ /* ++ * probably first extent we're gonna free will be last in block ++ */ ++ err = ext3_writepage_trans_blocks(inode) + 3; ++ handle = ext3_journal_start(inode, err); ++ if (IS_ERR(handle)) { ++ if (page) { ++ clear_highpage(page); ++ flush_dcache_page(page); ++ unlock_page(page); ++ page_cache_release(page); ++ } ++ return; ++ } ++ ++ if (page) ++ ext3_block_truncate_page(handle, page, mapping, inode->i_size); ++ ++ mutex_lock(&EXT3_I(inode)->truncate_mutex); ++ ext3_ext_invalidate_cache(&tree); ++ ++ /* ++ * TODO: optimization is possible here ++ * probably we need not scaning at all, ++ * because page truncation is enough ++ */ ++ if (ext3_orphan_add(handle, inode)) ++ goto out_stop; ++ ++ /* we have to know where to truncate from in crash case */ ++ EXT3_I(inode)->i_disksize = inode->i_size; ++ ext3_mark_inode_dirty(handle, inode); ++ ++ last_block = (inode->i_size + sb->s_blocksize - 1) >> ++ EXT3_BLOCK_SIZE_BITS(sb); ++ err = ext3_ext_remove_space(&tree, last_block, EXT_MAX_BLOCK); ++ ++ /* In a multi-transaction truncate, we only make the final ++ * transaction synchronous */ ++ if (IS_SYNC(inode)) ++ handle->h_sync = 1; ++ ++out_stop: ++ /* ++ * If this was a simple ftruncate(), and the file will remain alive ++ * then we need to clear up the orphan record which we created above. ++ * However, if this was a real unlink then we were called by ++ * ext3_delete_inode(), and we allow that function to clean up the ++ * orphan info for us. ++ */ ++ if (inode->i_nlink) ++ ext3_orphan_del(handle, inode); ++ ++ mutex_unlock(&EXT3_I(inode)->truncate_mutex); ++ ext3_journal_stop(handle); ++} ++ ++/* ++ * this routine calculate max number of blocks we could modify ++ * in order to allocate new block for an inode ++ */ ++int ext3_ext_writepage_trans_blocks(struct inode *inode, int num) ++{ ++ struct ext3_extents_tree tree; ++ int needed; ++ ++ ext3_init_tree_desc(&tree, inode); ++ ++ needed = ext3_ext_calc_credits_for_insert(&tree, NULL); ++ ++ /* caller want to allocate num blocks */ ++ needed *= num; ++ ++#ifdef CONFIG_QUOTA ++ /* ++ * FIXME: real calculation should be here ++ * it depends on blockmap format of qouta file ++ */ ++ needed += 2 * EXT3_SINGLEDATA_TRANS_BLOCKS; ++#endif ++ ++ return needed; ++} ++ ++void ext3_extents_initialize_blockmap(handle_t *handle, struct inode *inode) ++{ ++ struct ext3_extents_tree tree; ++ ++ ext3_init_tree_desc(&tree, inode); ++ ext3_extent_tree_init(handle, &tree); ++} ++ ++int ext3_ext_calc_blockmap_metadata(struct inode *inode, int blocks) ++{ ++ struct ext3_extents_tree tree; ++ ++ ext3_init_tree_desc(&tree, inode); ++ return ext3_ext_calc_metadata_amount(&tree, blocks); ++} ++ ++static int ++ext3_ext_store_extent_cb(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_ext_cache *newex) ++{ ++ struct ext3_extent_buf *buf = (struct ext3_extent_buf *) tree->private; ++ ++ if (newex->ec_type != EXT3_EXT_CACHE_EXTENT) ++ return EXT_CONTINUE; ++ ++ if (buf->err < 0) ++ return EXT_BREAK; ++ if (buf->cur - buf->buffer + sizeof(*newex) > buf->buflen) ++ return EXT_BREAK; ++ ++ if (!copy_to_user(buf->cur, newex, sizeof(*newex))) { ++ buf->err++; ++ buf->cur += sizeof(*newex); ++ } else { ++ buf->err = -EFAULT; ++ return EXT_BREAK; ++ } ++ return EXT_CONTINUE; ++} ++ ++static int ++ext3_ext_collect_stats_cb(struct ext3_extents_tree *tree, ++ struct ext3_ext_path *path, ++ struct ext3_ext_cache *ex) ++{ ++ struct ext3_extent_tree_stats *buf = ++ (struct ext3_extent_tree_stats *) tree->private; ++ int depth; ++ ++ if (ex->ec_type != EXT3_EXT_CACHE_EXTENT) ++ return EXT_CONTINUE; ++ ++ depth = EXT_DEPTH(tree); ++ buf->extents_num++; ++ if (path[depth].p_ext == EXT_FIRST_EXTENT(path[depth].p_hdr)) ++ buf->leaf_num++; ++ return EXT_CONTINUE; ++} ++ ++int ext3_ext_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, ++ unsigned long arg) ++{ ++ int err = 0; ++ ++ if (!(EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)) ++ return -EINVAL; ++ ++ if (cmd == EXT3_IOC_GET_EXTENTS) { ++ struct ext3_extent_buf buf; ++ struct ext3_extents_tree tree; ++ ++ if (copy_from_user(&buf, (void *) arg, sizeof(buf))) ++ return -EFAULT; ++ ++ ext3_init_tree_desc(&tree, inode); ++ buf.cur = buf.buffer; ++ buf.err = 0; ++ tree.private = &buf; ++ mutex_lock(&EXT3_I(inode)->truncate_mutex); ++ err = ext3_ext_walk_space(&tree, buf.start, EXT_MAX_BLOCK, ++ ext3_ext_store_extent_cb); ++ mutex_unlock(&EXT3_I(inode)->truncate_mutex); ++ if (err == 0) ++ err = buf.err; ++ } else if (cmd == EXT3_IOC_GET_TREE_STATS) { ++ struct ext3_extent_tree_stats buf; ++ struct ext3_extents_tree tree; ++ ++ ext3_init_tree_desc(&tree, inode); ++ mutex_lock(&EXT3_I(inode)->truncate_mutex); ++ buf.depth = EXT_DEPTH(&tree); ++ buf.extents_num = 0; ++ buf.leaf_num = 0; ++ tree.private = &buf; ++ err = ext3_ext_walk_space(&tree, 0, EXT_MAX_BLOCK, ++ ext3_ext_collect_stats_cb); ++ mutex_unlock(&EXT3_I(inode)->truncate_mutex); ++ if (!err) ++ err = copy_to_user((void *) arg, &buf, sizeof(buf)); ++ } else if (cmd == EXT3_IOC_GET_TREE_DEPTH) { ++ struct ext3_extents_tree tree; ++ ext3_init_tree_desc(&tree, inode); ++ mutex_lock(&EXT3_I(inode)->truncate_mutex); ++ err = EXT_DEPTH(&tree); ++ mutex_unlock(&EXT3_I(inode)->truncate_mutex); ++ } ++ ++ return err; ++} ++ ++EXPORT_SYMBOL(ext3_init_tree_desc); ++EXPORT_SYMBOL(ext3_mark_inode_dirty); ++EXPORT_SYMBOL(ext3_ext_invalidate_cache); ++EXPORT_SYMBOL(ext3_ext_insert_extent); ++EXPORT_SYMBOL(ext3_ext_walk_space); ++EXPORT_SYMBOL(ext3_ext_find_goal); ++EXPORT_SYMBOL(ext3_ext_calc_credits_for_insert); +Index: linux-stage/fs/ext3/ialloc.c +=================================================================== +--- linux-stage.orig/fs/ext3/ialloc.c 2006-07-16 13:55:31.000000000 +0800 ++++ linux-stage/fs/ext3/ialloc.c 2006-07-16 14:10:20.000000000 +0800 +@@ -600,7 +600,7 @@ got: + ei->i_dir_start_lookup = 0; + ei->i_disksize = 0; + +- ei->i_flags = EXT3_I(dir)->i_flags & ~EXT3_INDEX_FL; ++ ei->i_flags = EXT3_I(dir)->i_flags & ~(EXT3_INDEX_FL|EXT3_EXTENTS_FL); + if (S_ISLNK(mode)) + ei->i_flags &= ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL); + /* dirsync only applies to directories */ +@@ -644,6 +644,18 @@ got: + if (err) + goto fail_free_drop; + ++ if (test_opt(sb, EXTENTS) && S_ISREG(inode->i_mode)) { ++ EXT3_I(inode)->i_flags |= EXT3_EXTENTS_FL; ++ ext3_extents_initialize_blockmap(handle, inode); ++ if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS)) { ++ err = ext3_journal_get_write_access(handle, EXT3_SB(sb)->s_sbh); ++ if (err) goto fail; ++ EXT3_SET_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS); ++ BUFFER_TRACE(EXT3_SB(sb)->s_sbh, "call ext3_journal_dirty_metadata"); ++ err = ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); ++ } ++ } ++ + err = ext3_mark_inode_dirty(handle, inode); + if (err) { + ext3_std_error(sb, err); +Index: linux-stage/fs/ext3/inode.c +=================================================================== +--- linux-stage.orig/fs/ext3/inode.c 2006-07-16 13:55:31.000000000 +0800 ++++ linux-stage/fs/ext3/inode.c 2006-07-16 14:11:28.000000000 +0800 +@@ -40,7 +40,7 @@ + #include "iopen.h" + #include "acl.h" + +-static int ext3_writepage_trans_blocks(struct inode *inode); ++int ext3_writepage_trans_blocks(struct inode *inode); + + /* + * Test whether an inode is a fast symlink. +@@ -944,6 +944,17 @@ out: + + #define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32) + ++static inline int ++ext3_get_block_wrap(handle_t *handle, struct inode *inode, long block, ++ struct buffer_head *bh, int create, int extend_disksize) ++{ ++ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL) ++ return ext3_ext_get_block(handle, inode, block, bh, create, ++ extend_disksize); ++ return ext3_get_blocks_handle(handle, inode, block, 1, bh, create, ++ extend_disksize); ++} ++ + static int ext3_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) + { +@@ -984,8 +995,8 @@ static int ext3_get_block(struct inode * + + get_block: + if (ret == 0) { +- ret = ext3_get_blocks_handle(handle, inode, iblock, +- max_blocks, bh_result, create, 0); ++ ret = ext3_get_block_wrap(handle, inode, iblock, ++ bh_result, create, 0); + if (ret > 0) { + bh_result->b_size = (ret << inode->i_blkbits); + ret = 0; +@@ -1008,7 +1019,7 @@ struct buffer_head *ext3_getblk(handle_t + dummy.b_state = 0; + dummy.b_blocknr = -1000; + buffer_trace_init(&dummy.b_history); +- err = ext3_get_blocks_handle(handle, inode, block, 1, ++ err = ext3_get_block_wrap(handle, inode, block, + &dummy, create, 1); + if (err == 1) { + err = 0; +@@ -1756,7 +1767,7 @@ void ext3_set_aops(struct inode *inode) + * This required during truncate. We need to physically zero the tail end + * of that block so it doesn't yield old data if the file is later grown. + */ +-static int ext3_block_truncate_page(handle_t *handle, struct page *page, ++int ext3_block_truncate_page(handle_t *handle, struct page *page, + struct address_space *mapping, loff_t from) + { + ext3_fsblk_t index = from >> PAGE_CACHE_SHIFT; +@@ -2260,6 +2271,9 @@ void ext3_truncate(struct inode *inode) + return; + } + ++ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL) ++ return ext3_ext_truncate(inode, page); ++ + handle = start_transaction(inode); + if (IS_ERR(handle)) { + if (page) { +@@ -3004,12 +3018,15 @@ err_out: + * block and work out the exact number of indirects which are touched. Pah. + */ + +-static int ext3_writepage_trans_blocks(struct inode *inode) ++int ext3_writepage_trans_blocks(struct inode *inode) + { + int bpp = ext3_journal_blocks_per_page(inode); + int indirects = (EXT3_NDIR_BLOCKS % bpp) ? 5 : 3; + int ret; + ++ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL) ++ return ext3_ext_writepage_trans_blocks(inode, bpp); ++ + if (ext3_should_journal_data(inode)) + ret = 3 * (bpp + indirects) + 2; + else +@@ -3277,7 +3294,7 @@ int ext3_prep_san_write(struct inode *in + + /* alloc blocks one by one */ + for (i = 0; i < nblocks; i++) { +- ret = ext3_get_block_handle(handle, inode, blocks[i], ++ ret = ext3_get_blocks_handle(handle, inode, blocks[i], 1, + &bh_tmp, 1, 1); + if (ret) + break; +@@ -3337,7 +3354,7 @@ int ext3_map_inode_page(struct inode *in + if (blocks[i] != 0) + continue; + +- rc = ext3_get_block_handle(handle, inode, iblock, &dummy, 1, 1); ++ rc = ext3_get_blocks_handle(handle, inode, iblock, 1, &dummy, 1, 1); + if (rc) { + printk(KERN_INFO "ext3_map_inode_page: error reading " + "block %ld\n", iblock); +Index: linux-stage/fs/ext3/Makefile +=================================================================== +--- linux-stage.orig/fs/ext3/Makefile 2006-07-16 13:55:31.000000000 +0800 ++++ linux-stage/fs/ext3/Makefile 2006-07-16 14:10:21.000000000 +0800 +@@ -5,7 +5,8 @@ + obj-$(CONFIG_EXT3_FS) += ext3.o + + ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ +- ioctl.o namei.o super.o symlink.o hash.o resize.o ++ ioctl.o namei.o super.o symlink.o hash.o resize.o \ ++ extents.o + + ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o + ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o +Index: linux-stage/fs/ext3/super.c +=================================================================== +--- linux-stage.orig/fs/ext3/super.c 2006-07-16 13:55:31.000000000 +0800 ++++ linux-stage/fs/ext3/super.c 2006-07-16 14:10:21.000000000 +0800 +@@ -391,6 +391,7 @@ static void ext3_put_super (struct super + struct ext3_super_block *es = sbi->s_es; + int i; + ++ ext3_ext_release(sb); + ext3_xattr_put_super(sb); + journal_destroy(sbi->s_journal); + if (!(sb->s_flags & MS_RDONLY)) { +@@ -455,6 +456,8 @@ static struct inode *ext3_alloc_inode(st + #endif + ei->i_block_alloc_info = NULL; + ei->vfs_inode.i_version = 1; ++ ++ memset(&ei->i_cached_extent, 0, sizeof(ei->i_cached_extent)); + return &ei->vfs_inode; + } + +@@ -638,6 +641,7 @@ enum { + Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, + Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, + Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, ++ Opt_extents, Opt_extdebug, + Opt_grpquota + }; + +@@ -690,6 +694,8 @@ static match_table_t tokens = { + {Opt_iopen, "iopen"}, + {Opt_noiopen, "noiopen"}, + {Opt_iopen_nopriv, "iopen_nopriv"}, ++ {Opt_extents, "extents"}, ++ {Opt_extdebug, "extdebug"}, + {Opt_barrier, "barrier=%u"}, + {Opt_err, NULL}, + {Opt_resize, "resize"}, +@@ -1035,6 +1041,12 @@ clear_qf_name: + case Opt_bh: + clear_opt(sbi->s_mount_opt, NOBH); + break; ++ case Opt_extents: ++ set_opt (sbi->s_mount_opt, EXTENTS); ++ break; ++ case Opt_extdebug: ++ set_opt (sbi->s_mount_opt, EXTDEBUG); ++ break; + default: + printk (KERN_ERR + "EXT3-fs: Unrecognized mount option \"%s\" " +@@ -1760,6 +1772,7 @@ static int ext3_fill_super (struct super + test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered": + "writeback"); + ++ ext3_ext_init(sb); + lock_kernel(); + return 0; + +Index: linux-stage/fs/ext3/ioctl.c +=================================================================== +--- linux-stage.orig/fs/ext3/ioctl.c 2006-07-16 13:55:31.000000000 +0800 ++++ linux-stage/fs/ext3/ioctl.c 2006-07-16 13:55:31.000000000 +0800 +@@ -135,6 +135,10 @@ flags_err: + mutex_unlock(&inode->i_mutex); + return err; + } ++ case EXT3_IOC_GET_EXTENTS: ++ case EXT3_IOC_GET_TREE_STATS: ++ case EXT3_IOC_GET_TREE_DEPTH: ++ return ext3_ext_ioctl(inode, filp, cmd, arg); + case EXT3_IOC_GETVERSION: + case EXT3_IOC_GETVERSION_OLD: + return put_user(inode->i_generation, (int __user *) arg); +Index: linux-stage/include/linux/ext3_fs.h +=================================================================== +--- linux-stage.orig/include/linux/ext3_fs.h 2006-07-16 13:55:31.000000000 +0800 ++++ linux-stage/include/linux/ext3_fs.h 2006-07-16 14:10:21.000000000 +0800 +@@ -181,9 +181,10 @@ struct ext3_group_desc + #define EXT3_NOTAIL_FL 0x00008000 /* file tail should not be merged */ + #define EXT3_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */ + #define EXT3_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/ ++#define EXT3_EXTENTS_FL 0x00080000 /* Inode uses extents */ + #define EXT3_RESERVED_FL 0x80000000 /* reserved for ext3 lib */ + +-#define EXT3_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */ ++#define EXT3_FL_USER_VISIBLE 0x000BDFFF /* User visible flags */ + #define EXT3_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */ + + /* +@@ -233,6 +234,9 @@ struct ext3_new_group_data { + #endif + #define EXT3_IOC_GETRSVSZ _IOR('f', 5, long) + #define EXT3_IOC_SETRSVSZ _IOW('f', 6, long) ++#define EXT3_IOC_GET_EXTENTS _IOR('f', 7, long) ++#define EXT3_IOC_GET_TREE_DEPTH _IOR('f', 8, long) ++#define EXT3_IOC_GET_TREE_STATS _IOR('f', 9, long) + + /* + * Mount options +@@ -373,6 +377,8 @@ struct ext3_inode { + #define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */ + #define EXT3_MOUNT_IOPEN 0x400000 /* Allow access via iopen */ + #define EXT3_MOUNT_IOPEN_NOPRIV 0x800000/* Make iopen world-readable */ ++#define EXT3_MOUNT_EXTENTS 0x1000000/* Extents support */ ++#define EXT3_MOUNT_EXTDEBUG 0x2000000/* Extents debug */ + + /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ + #ifndef clear_opt +@@ -563,11 +569,13 @@ static inline struct ext3_inode_info *EX + #define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 /* Needs recovery */ + #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */ + #define EXT3_FEATURE_INCOMPAT_META_BG 0x0010 ++#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040 /* extents support */ + + #define EXT3_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR + #define EXT3_FEATURE_INCOMPAT_SUPP (EXT3_FEATURE_INCOMPAT_FILETYPE| \ + EXT3_FEATURE_INCOMPAT_RECOVER| \ +- EXT3_FEATURE_INCOMPAT_META_BG) ++ EXT3_FEATURE_INCOMPAT_META_BG| \ ++ EXT3_FEATURE_INCOMPAT_EXTENTS) + #define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \ + EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \ + EXT3_FEATURE_RO_COMPAT_BTREE_DIR) +@@ -787,6 +795,8 @@ extern unsigned long ext3_count_free (st + + + /* inode.c */ ++extern int ext3_block_truncate_page(handle_t *, struct page *, ++ struct address_space *, loff_t); + int ext3_forget(handle_t *handle, int is_metadata, struct inode *inode, + struct buffer_head *bh, ext3_fsblk_t blocknr); + struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *); +@@ -860,6 +870,16 @@ extern struct inode_operations ext3_spec + extern struct inode_operations ext3_symlink_inode_operations; + extern struct inode_operations ext3_fast_symlink_inode_operations; + ++/* extents.c */ ++extern int ext3_ext_writepage_trans_blocks(struct inode *, int); ++extern int ext3_ext_get_block(handle_t *, struct inode *, long, ++ struct buffer_head *, int, int); ++extern void ext3_ext_truncate(struct inode *, struct page *); ++extern void ext3_ext_init(struct super_block *); ++extern void ext3_ext_release(struct super_block *); ++extern void ext3_extents_initialize_blockmap(handle_t *, struct inode *); ++extern int ext3_ext_ioctl(struct inode *inode, struct file *filp, ++ unsigned int cmd, unsigned long arg); + + #endif /* __KERNEL__ */ + +Index: linux-stage/include/linux/ext3_extents.h +=================================================================== +--- /dev/null 1970-01-01 00:00:00.000000000 +0000 ++++ linux-stage/include/linux/ext3_extents.h 2006-07-16 13:55:31.000000000 +0800 +@@ -0,0 +1,264 @@ ++/* ++ * Copyright (c) 2003, Cluster File Systems, Inc, info@clusterfs.com ++ * Written by Alex Tomas ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License version 2 as ++ * published by the Free Software Foundation. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public Licens ++ * along with this program; if not, write to the Free Software ++ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- ++ */ ++ ++#ifndef _LINUX_EXT3_EXTENTS ++#define _LINUX_EXT3_EXTENTS ++ ++/* ++ * with AGRESSIVE_TEST defined capacity of index/leaf blocks ++ * become very little, so index split, in-depth growing and ++ * other hard changes happens much more often ++ * this is for debug purposes only ++ */ ++#define AGRESSIVE_TEST_ ++ ++/* ++ * if CHECK_BINSEARCH defined, then results of binary search ++ * will be checked by linear search ++ */ ++#define CHECK_BINSEARCH_ ++ ++/* ++ * if EXT_DEBUG is defined you can use 'extdebug' mount option ++ * to get lots of info what's going on ++ */ ++#define EXT_DEBUG_ ++#ifdef EXT_DEBUG ++#define ext_debug(tree,fmt,a...) \ ++do { \ ++ if (test_opt((tree)->inode->i_sb, EXTDEBUG)) \ ++ printk(fmt, ##a); \ ++} while (0); ++#else ++#define ext_debug(tree,fmt,a...) ++#endif ++ ++/* ++ * if EXT_STATS is defined then stats numbers are collected ++ * these number will be displayed at umount time ++ */ ++#define EXT_STATS_ ++ ++ ++#define EXT3_ALLOC_NEEDED 3 /* block bitmap + group desc. + sb */ ++ ++/* ++ * ext3_inode has i_block array (total 60 bytes) ++ * first 4 bytes are used to store: ++ * - tree depth (0 mean there is no tree yet. all extents in the inode) ++ * - number of alive extents in the inode ++ */ ++ ++/* ++ * this is extent on-disk structure ++ * it's used at the bottom of the tree ++ */ ++struct ext3_extent { ++ __u32 ee_block; /* first logical block extent covers */ ++ __u16 ee_len; /* number of blocks covered by extent */ ++ __u16 ee_start_hi; /* high 16 bits of physical block */ ++ __u32 ee_start; /* low 32 bigs of physical block */ ++}; ++ ++/* ++ * this is index on-disk structure ++ * it's used at all the levels, but the bottom ++ */ ++struct ext3_extent_idx { ++ __u32 ei_block; /* index covers logical blocks from 'block' */ ++ __u32 ei_leaf; /* pointer to the physical block of the next * ++ * level. leaf or next index could bet here */ ++ __u16 ei_leaf_hi; /* high 16 bits of physical block */ ++ __u16 ei_unused; ++}; ++ ++/* ++ * each block (leaves and indexes), even inode-stored has header ++ */ ++struct ext3_extent_header { ++ __u16 eh_magic; /* probably will support different formats */ ++ __u16 eh_entries; /* number of valid entries */ ++ __u16 eh_max; /* capacity of store in entries */ ++ __u16 eh_depth; /* has tree real underlaying blocks? */ ++ __u32 eh_generation; /* generation of the tree */ ++}; ++ ++#define EXT3_EXT_MAGIC 0xf30a ++ ++/* ++ * array of ext3_ext_path contains path to some extent ++ * creation/lookup routines use it for traversal/splitting/etc ++ * truncate uses it to simulate recursive walking ++ */ ++struct ext3_ext_path { ++ __u32 p_block; ++ __u16 p_depth; ++ struct ext3_extent *p_ext; ++ struct ext3_extent_idx *p_idx; ++ struct ext3_extent_header *p_hdr; ++ struct buffer_head *p_bh; ++}; ++ ++/* ++ * structure for external API ++ */ ++ ++/* ++ * storage for cached extent ++ */ ++struct ext3_ext_cache { ++ __u32 ec_start; ++ __u32 ec_block; ++ __u32 ec_len; ++ __u32 ec_type; ++}; ++ ++#define EXT3_EXT_CACHE_NO 0 ++#define EXT3_EXT_CACHE_GAP 1 ++#define EXT3_EXT_CACHE_EXTENT 2 ++ ++/* ++ * ext3_extents_tree is used to pass initial information ++ * to top-level extents API ++ */ ++struct ext3_extents_helpers; ++struct ext3_extents_tree { ++ struct inode *inode; /* inode which tree belongs to */ ++ void *root; /* ptr to data top of tree resides at */ ++ void *buffer; /* will be passed as arg to ^^ routines */ ++ int buffer_len; ++ void *private; ++ struct ext3_ext_cache *cex;/* last found extent */ ++ struct ext3_extents_helpers *ops; ++}; ++ ++struct ext3_extents_helpers { ++ int (*get_write_access)(handle_t *h, void *buffer); ++ int (*mark_buffer_dirty)(handle_t *h, void *buffer); ++ int (*mergable)(struct ext3_extent *ex1, struct ext3_extent *ex2); ++ int (*remove_extent_credits)(struct ext3_extents_tree *, ++ struct ext3_extent *, unsigned long, ++ unsigned long); ++ int (*remove_extent)(struct ext3_extents_tree *, ++ struct ext3_extent *, unsigned long, ++ unsigned long); ++ int (*new_block)(handle_t *, struct ext3_extents_tree *, ++ struct ext3_ext_path *, struct ext3_extent *, ++ int *); ++}; ++ ++/* ++ * to be called by ext3_ext_walk_space() ++ * negative retcode - error ++ * positive retcode - signal for ext3_ext_walk_space(), see below ++ * callback must return valid extent (passed or newly created) ++ */ ++typedef int (*ext_prepare_callback)(struct ext3_extents_tree *, ++ struct ext3_ext_path *, ++ struct ext3_ext_cache *); ++ ++#define EXT_CONTINUE 0 ++#define EXT_BREAK 1 ++#define EXT_REPEAT 2 ++ ++ ++#define EXT_MAX_BLOCK 0xffffffff ++ ++ ++#define EXT_FIRST_EXTENT(__hdr__) \ ++ ((struct ext3_extent *) (((char *) (__hdr__)) + \ ++ sizeof(struct ext3_extent_header))) ++#define EXT_FIRST_INDEX(__hdr__) \ ++ ((struct ext3_extent_idx *) (((char *) (__hdr__)) + \ ++ sizeof(struct ext3_extent_header))) ++#define EXT_HAS_FREE_INDEX(__path__) \ ++ ((__path__)->p_hdr->eh_entries < (__path__)->p_hdr->eh_max) ++#define EXT_LAST_EXTENT(__hdr__) \ ++ (EXT_FIRST_EXTENT((__hdr__)) + (__hdr__)->eh_entries - 1) ++#define EXT_LAST_INDEX(__hdr__) \ ++ (EXT_FIRST_INDEX((__hdr__)) + (__hdr__)->eh_entries - 1) ++#define EXT_MAX_EXTENT(__hdr__) \ ++ (EXT_FIRST_EXTENT((__hdr__)) + (__hdr__)->eh_max - 1) ++#define EXT_MAX_INDEX(__hdr__) \ ++ (EXT_FIRST_INDEX((__hdr__)) + (__hdr__)->eh_max - 1) ++ ++#define EXT_ROOT_HDR(tree) \ ++ ((struct ext3_extent_header *) (tree)->root) ++#define EXT_BLOCK_HDR(bh) \ ++ ((struct ext3_extent_header *) (bh)->b_data) ++#define EXT_DEPTH(_t_) \ ++ (((struct ext3_extent_header *)((_t_)->root))->eh_depth) ++#define EXT_GENERATION(_t_) \ ++ (((struct ext3_extent_header *)((_t_)->root))->eh_generation) ++ ++ ++#define EXT_ASSERT(__x__) if (!(__x__)) BUG(); ++ ++#define EXT_CHECK_PATH(tree,path) \ ++{ \ ++ int depth = EXT_DEPTH(tree); \ ++ BUG_ON((unsigned long) (path) < __PAGE_OFFSET); \ ++ BUG_ON((unsigned long) (path)[depth].p_idx < \ ++ __PAGE_OFFSET && (path)[depth].p_idx != NULL); \ ++ BUG_ON((unsigned long) (path)[depth].p_ext < \ ++ __PAGE_OFFSET && (path)[depth].p_ext != NULL); \ ++ BUG_ON((unsigned long) (path)[depth].p_hdr < __PAGE_OFFSET); \ ++ BUG_ON((unsigned long) (path)[depth].p_bh < __PAGE_OFFSET \ ++ && depth != 0); \ ++ BUG_ON((path)[0].p_depth != depth); \ ++} ++ ++ ++/* ++ * this structure is used to gather extents from the tree via ioctl ++ */ ++struct ext3_extent_buf { ++ unsigned long start; ++ int buflen; ++ void *buffer; ++ void *cur; ++ int err; ++}; ++ ++/* ++ * this structure is used to collect stats info about the tree ++ */ ++struct ext3_extent_tree_stats { ++ int depth; ++ int extents_num; ++ int leaf_num; ++}; ++ ++extern void ext3_init_tree_desc(struct ext3_extents_tree *, struct inode *); ++extern int ext3_extent_tree_init(handle_t *, struct ext3_extents_tree *); ++extern int ext3_ext_calc_credits_for_insert(struct ext3_extents_tree *, struct ext3_ext_path *); ++extern int ext3_ext_insert_extent(handle_t *, struct ext3_extents_tree *, struct ext3_ext_path *, struct ext3_extent *); ++extern int ext3_ext_walk_space(struct ext3_extents_tree *, unsigned long, unsigned long, ext_prepare_callback); ++extern int ext3_ext_remove_space(struct ext3_extents_tree *, unsigned long, unsigned long); ++extern struct ext3_ext_path * ext3_ext_find_extent(struct ext3_extents_tree *, int, struct ext3_ext_path *); ++extern int ext3_ext_calc_blockmap_metadata(struct inode *, int); ++ ++static inline void ++ext3_ext_invalidate_cache(struct ext3_extents_tree *tree) ++{ ++ if (tree->cex) ++ tree->cex->ec_type = EXT3_EXT_CACHE_NO; ++} ++ ++ ++#endif /* _LINUX_EXT3_EXTENTS */ +Index: linux-stage/include/linux/ext3_fs_i.h +=================================================================== +--- linux-stage.orig/include/linux/ext3_fs_i.h 2006-07-16 13:55:30.000000000 +0800 ++++ linux-stage/include/linux/ext3_fs_i.h 2006-07-16 14:10:20.000000000 +0800 +@@ -142,6 +142,8 @@ struct ext3_inode_info { + */ + struct mutex truncate_mutex; + struct inode vfs_inode; ++ ++ __u32 i_cached_extent[4]; + }; + + #endif /* _LINUX_EXT3_FS_I */ diff --git a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.5.patch b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.5.patch index 9e78214..b6c37c1 100644 --- a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.5.patch +++ b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.5.patch @@ -3,7 +3,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c =================================================================== --- linux-2.6.5-sles9.orig/fs/ext3/extents.c 2005-02-17 22:07:57.023609040 +0300 +++ linux-2.6.5-sles9/fs/ext3/extents.c 2005-02-23 01:02:37.396435640 +0300 -@@ -0,0 +1,2355 @@ +@@ -0,0 +1,2361 @@ +/* + * Copyright(c) 2003, 2004, 2005, Cluster File Systems, Inc, info@clusterfs.com + * Written by Alex Tomas @@ -179,7 +179,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c +{ + struct ext3_extent_header *neh = EXT_ROOT_HDR(tree); + neh->eh_generation = ((EXT_FLAGS(neh) & ~EXT_FLAGS_CLR_UNKNOWN) << 24) | -+ (EXT_GENERATION(neh) + 1); ++ (EXT_HDR_GEN(neh) + 1); +} + +static inline int ext3_ext_space_block(struct ext3_extents_tree *tree) @@ -561,6 +561,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + + ix->ei_block = logical; + ix->ei_leaf = ptr; ++ ix->ei_leaf_hi = ix->ei_unused = 0; + curp->p_hdr->eh_entries++; + + EXT_ASSERT(curp->p_hdr->eh_entries <= curp->p_hdr->eh_max); @@ -723,6 +724,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + fidx = EXT_FIRST_INDEX(neh); + fidx->ei_block = border; + fidx->ei_leaf = oldblock; ++ fidx->ei_leaf_hi = fidx->ei_unused = 0; + + ext_debug(tree, "int.index at %d (block %lu): %lu -> %lu\n", + i, newblock, border, oldblock); @@ -856,6 +858,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + /* FIXME: it works, but actually path[0] can be index */ + curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block; + curp->p_idx->ei_leaf = newblock; ++ curp->p_idx->ei_leaf_hi = curp->p_idx->ei_unused = 0; + + neh = EXT_ROOT_HDR(tree); + fidx = EXT_FIRST_INDEX(neh); @@ -1404,6 +1407,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) { + ex->ee_block = cex->ec_block; + ex->ee_start = cex->ec_start; ++ ex->ee_start_hi = 0; + ex->ee_len = cex->ec_len; + ext_debug(tree, "%lu cached by %lu:%lu:%lu\n", + (unsigned long) block, @@ -1625,7 +1629,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + + if (num == 0) { + /* this extent is removed entirely mark slot unused */ -+ ex->ee_start = 0; ++ ex->ee_start = ex->ee_start_hi = 0; + eh->eh_entries--; + fu = ex; + } @@ -1647,7 +1651,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + while (lu < le) { + if (lu->ee_start) { + *fu = *lu; -+ lu->ee_start = 0; ++ lu->ee_start = lu->ee_start_hi = 0; + fu++; + } + lu++; @@ -2002,6 +2006,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + /* allocate new block for the extent */ + goal = ext3_ext_find_goal(inode, path, ex->ee_block); + ex->ee_start = ext3_new_block(handle, inode, goal, err); ++ ex->ee_start_hi = 0; + if (ex->ee_start == 0) { + /* error occured: restore old extent */ + ex->ee_start = newblock; @@ -2117,6 +2122,7 @@ Index: linux-2.6.5-sles9/fs/ext3/extents.c + /* try to insert new extent into found leaf and return */ + newex.ee_block = iblock; + newex.ee_start = newblock; ++ newex.ee_start_hi = 0; + newex.ee_len = 1; + err = ext3_ext_insert_extent(handle, &tree, path, &newex); + if (err) @@ -2512,26 +2518,30 @@ Index: linux-2.6.5-sles9/fs/ext3/super.c Opt_ignore, Opt_barrier, Opt_err, Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, -+ Opt_extents, Opt_extdebug, ++ Opt_extents, Opt_noextents, Opt_extdebug, }; static match_table_t tokens = { -@@ -582,6 +585,8 @@ +@@ -582,6 +585,9 @@ {Opt_iopen, "iopen"}, {Opt_noiopen, "noiopen"}, {Opt_iopen_nopriv, "iopen_nopriv"}, + {Opt_extents, "extents"}, ++ {Opt_noextents, "noextents"}, + {Opt_extdebug, "extdebug"}, {Opt_barrier, "barrier=%u"}, {Opt_err, NULL} }; -@@ -797,6 +802,12 @@ +@@ -797,6 +802,15 @@ break; case Opt_ignore: break; + case Opt_extents: + set_opt (sbi->s_mount_opt, EXTENTS); + break; ++ case Opt_noextents: ++ clear_opt (sbi->s_mount_opt, EXTENTS); ++ break; + case Opt_extdebug: + set_opt (sbi->s_mount_opt, EXTDEBUG); + break; @@ -2611,11 +2621,13 @@ Index: linux-2.6.5-sles9/include/linux/ext3_fs.h #define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \ EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \ EXT3_FEATURE_RO_COMPAT_BTREE_DIR) -@@ -729,6 +735,7 @@ +@@ -729,6 +735,9 @@ /* inode.c */ -+extern int ext3_block_truncate_page(handle_t *, struct page *, struct address_space *, loff_t); ++extern int ext3_block_truncate_page(handle_t *, struct page *, ++ struct address_space *, loff_t); ++extern int ext3_writepage_trans_blocks(struct inode *inode); extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, int); extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *); extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *); @@ -2839,14 +2851,14 @@ Index: linux-2.6.5-sles9/include/linux/ext3_extents.h + (EXT_FIRST_EXTENT((__hdr__)) + (__hdr__)->eh_max - 1) +#define EXT_MAX_INDEX(__hdr__) \ + (EXT_FIRST_INDEX((__hdr__)) + (__hdr__)->eh_max - 1) -+#define EXT_GENERATION(__hdr__) ((__hdr__)->eh_generation & 0x00ffffff) ++#define EXT_HDR_GEN(__hdr__) ((__hdr__)->eh_generation & 0x00ffffff) +#define EXT_FLAGS(__hdr__) ((__hdr__)->eh_generation >> 24) +#define EXT_FLAGS_CLR_UNKNOWN 0x7 /* Flags cleared on modification */ + +#define EXT_BLOCK_HDR(__bh__) ((struct ext3_extent_header *)(__bh__)->b_data) +#define EXT_ROOT_HDR(__tree__) ((struct ext3_extent_header *)(__tree__)->root) +#define EXT_DEPTH(__tree__) (EXT_ROOT_HDR(__tree__)->eh_depth) -+ ++#define EXT_GENERATION(__tree__) EXT_HDR_GEN(EXT_ROOT_HDR(__tree__)) + +#define EXT_ASSERT(__x__) if (!(__x__)) BUG(); + diff --git a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.9-rhel4.patch b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.9-rhel4.patch index bd95c54..5b5558c 100644 --- a/ldiskfs/kernel_patches/patches/ext3-extents-2.6.9-rhel4.patch +++ b/ldiskfs/kernel_patches/patches/ext3-extents-2.6.9-rhel4.patch @@ -2,7 +2,7 @@ Index: linux-stage/fs/ext3/extents.c =================================================================== --- linux-stage.orig/fs/ext3/extents.c 2005-02-25 15:33:48.890198160 +0200 +++ linux-stage/fs/ext3/extents.c 2005-02-25 15:33:48.917194056 +0200 -@@ -0,0 +1,2353 @@ +@@ -0,0 +1,2359 @@ +/* + * Copyright(c) 2003, 2004, 2005, Cluster File Systems, Inc, info@clusterfs.com + * Written by Alex Tomas @@ -178,7 +178,7 @@ Index: linux-stage/fs/ext3/extents.c +{ + struct ext3_extent_header *neh = EXT_ROOT_HDR(tree); + neh->eh_generation = ((EXT_FLAGS(neh) & ~EXT_FLAGS_CLR_UNKNOWN) << 24) | -+ (EXT_GENERATION(neh) + 1); ++ (EXT_HDR_GEN(neh) + 1); +} + +static inline int ext3_ext_space_block(struct ext3_extents_tree *tree) @@ -560,6 +560,7 @@ Index: linux-stage/fs/ext3/extents.c + + ix->ei_block = logical; + ix->ei_leaf = ptr; ++ ix->ei_leaf_hi = ix->ei_unused = 0; + curp->p_hdr->eh_entries++; + + EXT_ASSERT(curp->p_hdr->eh_entries <= curp->p_hdr->eh_max); @@ -722,6 +723,7 @@ Index: linux-stage/fs/ext3/extents.c + fidx = EXT_FIRST_INDEX(neh); + fidx->ei_block = border; + fidx->ei_leaf = oldblock; ++ fidx->ei_leaf_hi = fidx->ei_unused = 0; + + ext_debug(tree, "int.index at %d (block %lu): %lu -> %lu\n", + i, newblock, border, oldblock); @@ -855,6 +857,7 @@ Index: linux-stage/fs/ext3/extents.c + /* FIXME: it works, but actually path[0] can be index */ + curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block; + curp->p_idx->ei_leaf = newblock; ++ curp->p_idx->ei_leaf_hi = curp->p_idx->ei_unused = 0; + + neh = EXT_ROOT_HDR(tree); + fidx = EXT_FIRST_INDEX(neh); @@ -1403,6 +1406,7 @@ Index: linux-stage/fs/ext3/extents.c + if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) { + ex->ee_block = cex->ec_block; + ex->ee_start = cex->ec_start; ++ ex->ee_start_hi = 0; + ex->ee_len = cex->ec_len; + ext_debug(tree, "%lu cached by %lu:%lu:%lu\n", + (unsigned long) block, @@ -1624,7 +1628,7 @@ Index: linux-stage/fs/ext3/extents.c + + if (num == 0) { + /* this extent is removed entirely mark slot unused */ -+ ex->ee_start = 0; ++ ex->ee_start = ex->ee_start_hi = 0; + eh->eh_entries--; + fu = ex; + } @@ -1646,7 +1650,7 @@ Index: linux-stage/fs/ext3/extents.c + while (lu < le) { + if (lu->ee_start) { + *fu = *lu; -+ lu->ee_start = 0; ++ lu->ee_start = lu->ee_start_hi = 0; + fu++; + } + lu++; @@ -2001,6 +2005,7 @@ Index: linux-stage/fs/ext3/extents.c + /* allocate new block for the extent */ + goal = ext3_ext_find_goal(inode, path, ex->ee_block); + ex->ee_start = ext3_new_block(handle, inode, goal, err); ++ ex->ee_start_hi = 0; + if (ex->ee_start == 0) { + /* error occured: restore old extent */ + ex->ee_start = newblock; @@ -2116,6 +2121,7 @@ Index: linux-stage/fs/ext3/extents.c + /* try to insert new extent into found leaf and return */ + newex.ee_block = iblock; + newex.ee_start = newblock; ++ newex.ee_start_hi = 0; + newex.ee_len = 1; + err = ext3_ext_insert_extent(handle, &tree, path, &newex); + if (err) @@ -2507,26 +2513,30 @@ Index: linux-stage/fs/ext3/super.c Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, -+ Opt_extents, Opt_extdebug, ++ Opt_extents, Opt_noextents, Opt_extdebug, }; static match_table_t tokens = { -@@ -639,6 +644,8 @@ +@@ -639,6 +644,9 @@ {Opt_iopen, "iopen"}, {Opt_noiopen, "noiopen"}, {Opt_iopen_nopriv, "iopen_nopriv"}, + {Opt_extents, "extents"}, ++ {Opt_noextents, "noextents"}, + {Opt_extdebug, "extdebug"}, {Opt_barrier, "barrier=%u"}, {Opt_err, NULL}, {Opt_resize, "resize"}, -@@ -943,6 +950,12 @@ +@@ -943,6 +950,15 @@ match_int(&args[0], &option); *n_blocks_count = option; break; + case Opt_extents: + set_opt (sbi->s_mount_opt, EXTENTS); + break; ++ case Opt_noextents: ++ clear_opt (sbi->s_mount_opt, EXTENTS); ++ break; + case Opt_extdebug: + set_opt (sbi->s_mount_opt, EXTDEBUG); + break; @@ -2606,11 +2616,13 @@ Index: linux-stage/include/linux/ext3_fs.h #define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \ EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \ EXT3_FEATURE_RO_COMPAT_BTREE_DIR) -@@ -756,6 +763,7 @@ +@@ -756,6 +763,9 @@ /* inode.c */ -+extern int ext3_block_truncate_page(handle_t *, struct page *, struct address_space *, loff_t); ++extern int ext3_block_truncate_page(handle_t *, struct page *, ++ struct address_space *, loff_t); ++extern int ext3_writepage_trans_blocks(struct inode *inode); extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, int); extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *); extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *); @@ -2834,14 +2846,14 @@ Index: linux-stage/include/linux/ext3_extents.h + (EXT_FIRST_EXTENT((__hdr__)) + (__hdr__)->eh_max - 1) +#define EXT_MAX_INDEX(__hdr__) \ + (EXT_FIRST_INDEX((__hdr__)) + (__hdr__)->eh_max - 1) -+#define EXT_GENERATION(__hdr__) ((__hdr__)->eh_generation & 0x00ffffff) ++#define EXT_HDR_GEN(__hdr__) ((__hdr__)->eh_generation & 0x00ffffff) +#define EXT_FLAGS(__hdr__) ((__hdr__)->eh_generation >> 24) +#define EXT_FLAGS_CLR_UNKNOWN 0x7 /* Flags cleared on modification */ + +#define EXT_BLOCK_HDR(__bh__) ((struct ext3_extent_header *)(__bh__)->b_data) +#define EXT_ROOT_HDR(__tree__) ((struct ext3_extent_header *)(__tree__)->root) +#define EXT_DEPTH(__tree__) (EXT_ROOT_HDR(__tree__)->eh_depth) -+ ++#define EXT_GENERATION(__tree__) EXT_HDR_GEN(EXT_ROOT_HDR(__tree__)) + +#define EXT_ASSERT(__x__) if (!(__x__)) BUG(); + diff --git a/ldiskfs/kernel_patches/patches/ext3-filterdata-2.6.15.patch b/ldiskfs/kernel_patches/patches/ext3-filterdata-2.6.15.patch new file mode 100644 index 0000000..e6d431f --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-filterdata-2.6.15.patch @@ -0,0 +1,25 @@ +Index: linux-2.6.15/include/linux/ext3_fs_i.h +=================================================================== +--- linux-2.6.15.orig/include/linux/ext3_fs_i.h 2006-02-24 15:41:30.000000000 +0300 ++++ linux-2.6.15/include/linux/ext3_fs_i.h 2006-02-24 15:41:31.000000000 +0300 +@@ -135,6 +135,8 @@ struct ext3_inode_info { + struct inode vfs_inode; + + __u32 i_cached_extent[4]; ++ ++ void *i_filterdata; + }; + + #endif /* _LINUX_EXT3_FS_I */ +Index: linux-2.6.15/fs/ext3/super.c +=================================================================== +--- linux-2.6.15.orig/fs/ext3/super.c 2006-02-24 15:41:30.000000000 +0300 ++++ linux-2.6.15/fs/ext3/super.c 2006-02-24 15:42:02.000000000 +0300 +@@ -459,6 +459,7 @@ static struct inode *ext3_alloc_inode(st + ei->vfs_inode.i_version = 1; + + memset(&ei->i_cached_extent, 0, sizeof(ei->i_cached_extent)); ++ ei->i_filterdata = NULL; + return &ei->vfs_inode; + } + diff --git a/ldiskfs/kernel_patches/patches/ext3-lookup-dotdot-2.6.9.patch b/ldiskfs/kernel_patches/patches/ext3-lookup-dotdot-2.6.9.patch new file mode 100644 index 0000000..a05256b --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-lookup-dotdot-2.6.9.patch @@ -0,0 +1,63 @@ +Index: linux-2.6.9-full/fs/ext3/iopen.c +=================================================================== +--- linux-2.6.9-full.orig/fs/ext3/iopen.c 2006-04-25 08:51:11.000000000 +0400 ++++ linux-2.6.9-full/fs/ext3/iopen.c 2006-05-06 01:21:11.000000000 +0400 +@@ -94,9 +94,12 @@ static struct dentry *iopen_lookup(struc + assert(!(alternate->d_flags & DCACHE_DISCONNECTED)); + } + +- if (!list_empty(&inode->i_dentry)) { +- alternate = list_entry(inode->i_dentry.next, +- struct dentry, d_alias); ++ list_for_each(lp, &inode->i_dentry) { ++ alternate = list_entry(lp, struct dentry, d_alias); ++ /* ignore dentries created for ".." to preserve ++ * proper dcache hierarchy -- bug 10458 */ ++ if (alternate->d_flags & DCACHE_NFSFS_RENAMED) ++ continue; + dget_locked(alternate); + spin_lock(&alternate->d_lock); + alternate->d_flags |= DCACHE_REFERENCED; +Index: linux-2.6.9-full/fs/ext3/namei.c +=================================================================== +--- linux-2.6.9-full.orig/fs/ext3/namei.c 2006-05-06 01:21:10.000000000 +0400 ++++ linux-2.6.9-full/fs/ext3/namei.c 2006-05-06 01:29:30.000000000 +0400 +@@ -1003,6 +1003,38 @@ static struct dentry *ext3_lookup(struct + return ERR_PTR(-EACCES); + } + ++ /* ".." shouldn't go into dcache to preserve dcache hierarchy ++ * otherwise we'll get parent being a child of actual child. ++ * see bug 10458 for details -bzzz */ ++ if (inode && (dentry->d_name.name[0] == '.' && (dentry->d_name.len == 1 || ++ (dentry->d_name.len == 2 && dentry->d_name.name[1] == '.')))) { ++ struct dentry *tmp, *goal = NULL; ++ struct list_head *lp; ++ ++ /* first, look for an existing dentry - any one is good */ ++ spin_lock(&dcache_lock); ++ list_for_each(lp, &inode->i_dentry) { ++ tmp = list_entry(lp, struct dentry, d_alias); ++ goal = tmp; ++ dget_locked(goal); ++ break; ++ } ++ if (goal == NULL) { ++ /* there is no alias, we need to make current dentry: ++ * a) inaccessible for __d_lookup() ++ * b) inaccessible for iopen */ ++ J_ASSERT(list_empty(&dentry->d_alias)); ++ dentry->d_flags |= DCACHE_NFSFS_RENAMED; ++ /* this is d_instantiate() ... */ ++ list_add(&dentry->d_alias, &inode->i_dentry); ++ dentry->d_inode = inode; ++ } ++ spin_unlock(&dcache_lock); ++ if (goal) ++ iput(inode); ++ return goal; ++ } ++ + return iopen_connect_dentry(dentry, inode, 1); + } + diff --git a/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-fc5.patch b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-fc5.patch new file mode 100644 index 0000000..325d080 --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-fc5.patch @@ -0,0 +1,2779 @@ +Index: linux-2.6.16.i686/fs/ext3/inode.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/inode.c 2006-05-30 22:55:32.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/inode.c 2006-05-30 23:02:59.000000000 +0800 +@@ -568,7 +568,7 @@ + ext3_journal_forget(handle, branch[i].bh); + } + for (i = 0; i < keys; i++) +- ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1); ++ ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1, 1); + return err; + } + +@@ -1862,7 +1862,7 @@ + } + } + +- ext3_free_blocks(handle, inode, block_to_free, count); ++ ext3_free_blocks(handle, inode, block_to_free, count, 1); + } + + /** +@@ -2035,7 +2035,7 @@ + ext3_journal_test_restart(handle, inode); + } + +- ext3_free_blocks(handle, inode, nr, 1); ++ ext3_free_blocks(handle, inode, nr, 1, 1); + + if (parent_bh) { + /* +Index: linux-2.6.16.i686/fs/ext3/mballoc.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/mballoc.c 2006-05-31 04:14:15.752410384 +0800 ++++ linux-2.6.16.i686/fs/ext3/mballoc.c 2006-05-30 23:03:38.000000000 +0800 +@@ -0,0 +1,2434 @@ ++/* ++ * Copyright (c) 2003-2005, Cluster File Systems, Inc, info@clusterfs.com ++ * Written by Alex Tomas ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License version 2 as ++ * published by the Free Software Foundation. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public Licens ++ * along with this program; if not, write to the Free Software ++ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- ++ */ ++ ++ ++/* ++ * mballoc.c contains the multiblocks allocation routines ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++/* ++ * TODO: ++ * - bitmap read-ahead (proposed by Oleg Drokin aka green) ++ * - track min/max extents in each group for better group selection ++ * - mb_mark_used() may allocate chunk right after splitting buddy ++ * - special flag to advice allocator to look for requested + N blocks ++ * this may improve interaction between extents and mballoc ++ * - tree of groups sorted by number of free blocks ++ * - percpu reservation code (hotpath) ++ * - error handling ++ */ ++ ++/* ++ * with AGRESSIVE_CHECK allocator runs consistency checks over ++ * structures. these checks slow things down a lot ++ */ ++#define AGGRESSIVE_CHECK__ ++ ++/* ++ */ ++#define MB_DEBUG__ ++#ifdef MB_DEBUG ++#define mb_debug(fmt,a...) printk(fmt, ##a) ++#else ++#define mb_debug(fmt,a...) ++#endif ++ ++/* ++ * with EXT3_MB_HISTORY mballoc stores last N allocations in memory ++ * and you can monitor it in /proc/fs/ext3//mb_history ++ */ ++#define EXT3_MB_HISTORY ++ ++/* ++ * How long mballoc can look for a best extent (in found extents) ++ */ ++long ext3_mb_max_to_scan = 500; ++ ++/* ++ * How long mballoc must look for a best extent ++ */ ++long ext3_mb_min_to_scan = 30; ++ ++/* ++ * with 'ext3_mb_stats' allocator will collect stats that will be ++ * shown at umount. The collecting costs though! ++ */ ++ ++long ext3_mb_stats = 1; ++ ++#ifdef EXT3_BB_MAX_BLOCKS ++#undef EXT3_BB_MAX_BLOCKS ++#endif ++#define EXT3_BB_MAX_BLOCKS 30 ++ ++struct ext3_free_metadata { ++ unsigned short group; ++ unsigned short num; ++ unsigned short blocks[EXT3_BB_MAX_BLOCKS]; ++ struct list_head list; ++}; ++ ++struct ext3_group_info { ++ unsigned long bb_state; ++ unsigned long bb_tid; ++ struct ext3_free_metadata *bb_md_cur; ++ unsigned short bb_first_free; ++ unsigned short bb_free; ++ unsigned short bb_fragments; ++ unsigned short bb_counters[]; ++}; ++ ++ ++#define EXT3_GROUP_INFO_NEED_INIT_BIT 0 ++#define EXT3_GROUP_INFO_LOCKED_BIT 1 ++ ++#define EXT3_MB_GRP_NEED_INIT(grp) \ ++ (test_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, &(grp)->bb_state)) ++ ++struct ext3_free_extent { ++ __u16 fe_start; ++ __u16 fe_len; ++ __u16 fe_group; ++}; ++ ++struct ext3_allocation_context { ++ struct super_block *ac_sb; ++ ++ /* search goals */ ++ struct ext3_free_extent ac_g_ex; ++ ++ /* the best found extent */ ++ struct ext3_free_extent ac_b_ex; ++ ++ /* number of iterations done. we have to track to limit searching */ ++ unsigned long ac_ex_scanned; ++ __u16 ac_groups_scanned; ++ __u16 ac_found; ++ __u16 ac_tail; ++ __u16 ac_buddy; ++ __u8 ac_status; ++ __u8 ac_flags; /* allocation hints */ ++ __u8 ac_criteria; ++ __u8 ac_repeats; ++ __u8 ac_2order; /* if request is to allocate 2^N blocks and ++ * N > 0, the field stores N, otherwise 0 */ ++}; ++ ++#define AC_STATUS_CONTINUE 1 ++#define AC_STATUS_FOUND 2 ++#define AC_STATUS_BREAK 3 ++ ++struct ext3_mb_history { ++ struct ext3_free_extent goal; /* goal allocation */ ++ struct ext3_free_extent result; /* result allocation */ ++ __u16 found; /* how many extents have been found */ ++ __u16 groups; /* how many groups have been scanned */ ++ __u16 tail; /* what tail broke some buddy */ ++ __u16 buddy; /* buddy the tail ^^^ broke */ ++ __u8 cr; /* which phase the result extent was found at */ ++ __u8 merged; ++}; ++ ++struct ext3_buddy { ++ struct page *bd_buddy_page; ++ void *bd_buddy; ++ struct page *bd_bitmap_page; ++ void *bd_bitmap; ++ struct ext3_group_info *bd_info; ++ struct super_block *bd_sb; ++ __u16 bd_blkbits; ++ __u16 bd_group; ++}; ++#define EXT3_MB_BITMAP(e3b) ((e3b)->bd_bitmap) ++#define EXT3_MB_BUDDY(e3b) ((e3b)->bd_buddy) ++ ++#ifndef EXT3_MB_HISTORY ++#define ext3_mb_store_history(sb,ac) ++#else ++static void ext3_mb_store_history(struct super_block *, ++ struct ext3_allocation_context *ac); ++#endif ++ ++#define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1) ++ ++static struct proc_dir_entry *proc_root_ext3; ++ ++int ext3_create (struct inode *, struct dentry *, int, struct nameidata *); ++struct buffer_head * read_block_bitmap(struct super_block *, unsigned int); ++int ext3_new_block_old(handle_t *, struct inode *, unsigned long, int *); ++int ext3_mb_reserve_blocks(struct super_block *, int); ++void ext3_mb_release_blocks(struct super_block *, int); ++void ext3_mb_poll_new_transaction(struct super_block *, handle_t *); ++void ext3_mb_free_committed_blocks(struct super_block *); ++ ++#if BITS_PER_LONG == 64 ++#define mb_correct_addr_and_bit(bit,addr) \ ++{ \ ++ bit += ((unsigned long) addr & 7UL) << 3; \ ++ addr = (void *) ((unsigned long) addr & ~7UL); \ ++} ++#elif BITS_PER_LONG == 32 ++#define mb_correct_addr_and_bit(bit,addr) \ ++{ \ ++ bit += ((unsigned long) addr & 3UL) << 3; \ ++ addr = (void *) ((unsigned long) addr & ~3UL); \ ++} ++#else ++#error "how many bits you are?!" ++#endif ++ ++static inline int mb_test_bit(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ return ext2_test_bit(bit, addr); ++} ++ ++static inline void mb_set_bit(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_set_bit(bit, addr); ++} ++ ++static inline void mb_set_bit_atomic(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_set_bit_atomic(NULL, bit, addr); ++} ++ ++static inline void mb_clear_bit(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_clear_bit(bit, addr); ++} ++ ++static inline void mb_clear_bit_atomic(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_clear_bit_atomic(NULL, bit, addr); ++} ++ ++static inline int mb_find_next_zero_bit(void *addr, int max, int start) ++{ ++ int fix; ++#if BITS_PER_LONG == 64 ++ fix = ((unsigned long) addr & 7UL) << 3; ++ addr = (void *) ((unsigned long) addr & ~7UL); ++#elif BITS_PER_LONG == 32 ++ fix = ((unsigned long) addr & 3UL) << 3; ++ addr = (void *) ((unsigned long) addr & ~3UL); ++#else ++#error "how many bits you are?!" ++#endif ++ max += fix; ++ start += fix; ++ return ext2_find_next_zero_bit(addr, max, start) - fix; ++} ++ ++static inline void *mb_find_buddy(struct ext3_buddy *e3b, int order, int *max) ++{ ++ char *bb; ++ ++ J_ASSERT(EXT3_MB_BITMAP(e3b) != EXT3_MB_BUDDY(e3b)); ++ J_ASSERT(max != NULL); ++ ++ if (order > e3b->bd_blkbits + 1) { ++ *max = 0; ++ return NULL; ++ } ++ ++ /* at order 0 we see each particular block */ ++ *max = 1 << (e3b->bd_blkbits + 3); ++ if (order == 0) ++ return EXT3_MB_BITMAP(e3b); ++ ++ bb = EXT3_MB_BUDDY(e3b) + EXT3_SB(e3b->bd_sb)->s_mb_offsets[order]; ++ *max = EXT3_SB(e3b->bd_sb)->s_mb_maxs[order]; ++ ++ return bb; ++} ++ ++#ifdef AGGRESSIVE_CHECK ++ ++static void mb_check_buddy(struct ext3_buddy *e3b) ++{ ++ int order = e3b->bd_blkbits + 1; ++ int max, max2, i, j, k, count; ++ int fragments = 0, fstart; ++ void *buddy, *buddy2; ++ ++ if (!test_opt(e3b->bd_sb, MBALLOC)) ++ return; ++ ++ { ++ static int mb_check_counter = 0; ++ if (mb_check_counter++ % 300 != 0) ++ return; ++ } ++ ++ while (order > 1) { ++ buddy = mb_find_buddy(e3b, order, &max); ++ J_ASSERT(buddy); ++ buddy2 = mb_find_buddy(e3b, order - 1, &max2); ++ J_ASSERT(buddy2); ++ J_ASSERT(buddy != buddy2); ++ J_ASSERT(max * 2 == max2); ++ ++ count = 0; ++ for (i = 0; i < max; i++) { ++ ++ if (mb_test_bit(i, buddy)) { ++ /* only single bit in buddy2 may be 1 */ ++ if (!mb_test_bit(i << 1, buddy2)) ++ J_ASSERT(mb_test_bit((i<<1)+1, buddy2)); ++ else if (!mb_test_bit((i << 1) + 1, buddy2)) ++ J_ASSERT(mb_test_bit(i << 1, buddy2)); ++ continue; ++ } ++ ++ /* both bits in buddy2 must be 0 */ ++ J_ASSERT(mb_test_bit(i << 1, buddy2)); ++ J_ASSERT(mb_test_bit((i << 1) + 1, buddy2)); ++ ++ for (j = 0; j < (1 << order); j++) { ++ k = (i * (1 << order)) + j; ++ J_ASSERT(!mb_test_bit(k, EXT3_MB_BITMAP(e3b))); ++ } ++ count++; ++ } ++ J_ASSERT(e3b->bd_info->bb_counters[order] == count); ++ order--; ++ } ++ ++ fstart = -1; ++ buddy = mb_find_buddy(e3b, 0, &max); ++ for (i = 0; i < max; i++) { ++ if (!mb_test_bit(i, buddy)) { ++ J_ASSERT(i >= e3b->bd_info->bb_first_free); ++ if (fstart == -1) { ++ fragments++; ++ fstart = i; ++ } ++ continue; ++ } ++ fstart = -1; ++ /* check used bits only */ ++ for (j = 0; j < e3b->bd_blkbits + 1; j++) { ++ buddy2 = mb_find_buddy(e3b, j, &max2); ++ k = i >> j; ++ J_ASSERT(k < max2); ++ J_ASSERT(mb_test_bit(k, buddy2)); ++ } ++ } ++ J_ASSERT(!EXT3_MB_GRP_NEED_INIT(e3b->bd_info)); ++ J_ASSERT(e3b->bd_info->bb_fragments == fragments); ++} ++ ++#else ++#define mb_check_buddy(e3b) ++#endif ++ ++/* find most significant bit */ ++static int inline fmsb(unsigned short word) ++{ ++ int order; ++ ++ if (word > 255) { ++ order = 7; ++ word >>= 8; ++ } else { ++ order = -1; ++ } ++ ++ do { ++ order++; ++ word >>= 1; ++ } while (word != 0); ++ ++ return order; ++} ++ ++static void inline ++ext3_mb_mark_free_simple(struct super_block *sb, void *buddy, unsigned first, ++ int len, struct ext3_group_info *grp) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ unsigned short min, max, chunk, border; ++ ++ mb_debug("mark %u/%u free\n", first, len); ++ J_ASSERT(len < EXT3_BLOCKS_PER_GROUP(sb)); ++ ++ border = 2 << sb->s_blocksize_bits; ++ ++ while (len > 0) { ++ /* find how many blocks can be covered since this position */ ++ max = ffs(first | border) - 1; ++ ++ /* find how many blocks of power 2 we need to mark */ ++ min = fmsb(len); ++ ++ mb_debug(" %u/%u -> max %u, min %u\n", ++ first & ((2 << sb->s_blocksize_bits) - 1), ++ len, max, min); ++ ++ if (max < min) ++ min = max; ++ chunk = 1 << min; ++ ++ /* mark multiblock chunks only */ ++ grp->bb_counters[min]++; ++ if (min > 0) { ++ mb_debug(" set %u at %u \n", first >> min, ++ sbi->s_mb_offsets[min]); ++ mb_clear_bit(first >> min, buddy + sbi->s_mb_offsets[min]); ++ } ++ ++ len -= chunk; ++ first += chunk; ++ } ++} ++ ++static void ++ext3_mb_generate_buddy(struct super_block *sb, void *buddy, void *bitmap, ++ struct ext3_group_info *grp) ++{ ++ unsigned short max = EXT3_BLOCKS_PER_GROUP(sb); ++ unsigned short i = 0, first, len; ++ unsigned free = 0, fragments = 0; ++ unsigned long long period = get_cycles(); ++ ++ i = mb_find_next_zero_bit(bitmap, max, 0); ++ grp->bb_first_free = i; ++ while (i < max) { ++ fragments++; ++ first = i; ++ i = find_next_bit(bitmap, max, i); ++ len = i - first; ++ free += len; ++ if (len > 1) ++ ext3_mb_mark_free_simple(sb, buddy, first, len, grp); ++ else ++ grp->bb_counters[0]++; ++ if (i < max) ++ i = mb_find_next_zero_bit(bitmap, max, i); ++ } ++ grp->bb_fragments = fragments; ++ ++ /* bb_state shouldn't being modified because all ++ * others waits for init completion on page lock */ ++ clear_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, &grp->bb_state); ++ if (free != grp->bb_free) { ++ printk("EXT3-fs: %u blocks in bitmap, %u in group descriptor\n", ++ free, grp->bb_free); ++ grp->bb_free = free; ++ } ++ ++ period = get_cycles() - period; ++ spin_lock(&EXT3_SB(sb)->s_bal_lock); ++ EXT3_SB(sb)->s_mb_buddies_generated++; ++ EXT3_SB(sb)->s_mb_generation_time += period; ++ spin_unlock(&EXT3_SB(sb)->s_bal_lock); ++} ++ ++static int ext3_mb_init_cache(struct page *page) ++{ ++ int blocksize, blocks_per_page, groups_per_page; ++ int err = 0, i, first_group, first_block; ++ struct super_block *sb; ++ struct buffer_head *bhs; ++ struct buffer_head **bh; ++ struct inode *inode; ++ char *data, *bitmap; ++ ++ mb_debug("init page %lu\n", page->index); ++ ++ inode = page->mapping->host; ++ sb = inode->i_sb; ++ blocksize = 1 << inode->i_blkbits; ++ blocks_per_page = PAGE_CACHE_SIZE / blocksize; ++ ++ groups_per_page = blocks_per_page >> 1; ++ if (groups_per_page == 0) ++ groups_per_page = 1; ++ ++ /* allocate buffer_heads to read bitmaps */ ++ if (groups_per_page > 1) { ++ err = -ENOMEM; ++ i = sizeof(struct buffer_head *) * groups_per_page; ++ bh = kmalloc(i, GFP_NOFS); ++ if (bh == NULL) ++ goto out; ++ memset(bh, 0, i); ++ } else ++ bh = &bhs; ++ ++ first_group = page->index * blocks_per_page / 2; ++ ++ /* read all groups the page covers into the cache */ ++ for (i = 0; i < groups_per_page; i++) { ++ struct ext3_group_desc * desc; ++ ++ if (first_group + i >= EXT3_SB(sb)->s_groups_count) ++ break; ++ ++ err = -EIO; ++ desc = ext3_get_group_desc(sb, first_group + i, NULL); ++ if (desc == NULL) ++ goto out; ++ ++ err = -ENOMEM; ++ bh[i] = sb_getblk(sb, le32_to_cpu(desc->bg_block_bitmap)); ++ if (bh[i] == NULL) ++ goto out; ++ ++ if (buffer_uptodate(bh[i])) ++ continue; ++ ++ lock_buffer(bh[i]); ++ if (buffer_uptodate(bh[i])) { ++ unlock_buffer(bh[i]); ++ continue; ++ } ++ ++ get_bh(bh[i]); ++ bh[i]->b_end_io = end_buffer_read_sync; ++ submit_bh(READ, bh[i]); ++ mb_debug("read bitmap for group %u\n", first_group + i); ++ } ++ ++ /* wait for I/O completion */ ++ for (i = 0; i < groups_per_page && bh[i]; i++) ++ wait_on_buffer(bh[i]); ++ ++ /* XXX: I/O error handling here */ ++ ++ first_block = page->index * blocks_per_page; ++ for (i = 0; i < blocks_per_page; i++) { ++ int group; ++ ++ group = (first_block + i) >> 1; ++ if (group >= EXT3_SB(sb)->s_groups_count) ++ break; ++ ++ data = page_address(page) + (i * blocksize); ++ bitmap = bh[group - first_group]->b_data; ++ ++ if ((first_block + i) & 1) { ++ /* this is block of buddy */ ++ mb_debug("put buddy for group %u in page %lu/%x\n", ++ group, page->index, i * blocksize); ++ memset(data, 0xff, blocksize); ++ EXT3_SB(sb)->s_group_info[group]->bb_fragments = 0; ++ memset(EXT3_SB(sb)->s_group_info[group]->bb_counters, 0, ++ sizeof(unsigned short)*(sb->s_blocksize_bits+2)); ++ ext3_mb_generate_buddy(sb, data, bitmap, ++ EXT3_SB(sb)->s_group_info[group]); ++ } else { ++ /* this is block of bitmap */ ++ mb_debug("put bitmap for group %u in page %lu/%x\n", ++ group, page->index, i * blocksize); ++ memcpy(data, bitmap, blocksize); ++ } ++ } ++ SetPageUptodate(page); ++ ++out: ++ for (i = 0; i < groups_per_page && bh[i]; i++) ++ brelse(bh[i]); ++ if (bh && bh != &bhs) ++ kfree(bh); ++ return err; ++} ++ ++static int ext3_mb_load_buddy(struct super_block *sb, int group, ++ struct ext3_buddy *e3b) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct inode *inode = sbi->s_buddy_cache; ++ int blocks_per_page, block, pnum, poff; ++ struct page *page; ++ ++ mb_debug("load group %u\n", group); ++ ++ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; ++ ++ e3b->bd_blkbits = sb->s_blocksize_bits; ++ e3b->bd_info = sbi->s_group_info[group]; ++ e3b->bd_sb = sb; ++ e3b->bd_group = group; ++ e3b->bd_buddy_page = NULL; ++ e3b->bd_bitmap_page = NULL; ++ ++ block = group * 2; ++ pnum = block / blocks_per_page; ++ poff = block % blocks_per_page; ++ ++ page = find_get_page(inode->i_mapping, pnum); ++ if (page == NULL || !PageUptodate(page)) { ++ if (page) ++ page_cache_release(page); ++ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); ++ if (page) { ++ if (!PageUptodate(page)) ++ ext3_mb_init_cache(page); ++ unlock_page(page); ++ } ++ } ++ if (page == NULL || !PageUptodate(page)) ++ goto err; ++ e3b->bd_bitmap_page = page; ++ e3b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); ++ mark_page_accessed(page); ++ ++ block++; ++ pnum = block / blocks_per_page; ++ poff = block % blocks_per_page; ++ ++ page = find_get_page(inode->i_mapping, pnum); ++ if (page == NULL || !PageUptodate(page)) { ++ if (page) ++ page_cache_release(page); ++ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); ++ if (page) { ++ if (!PageUptodate(page)) ++ ext3_mb_init_cache(page); ++ unlock_page(page); ++ } ++ } ++ if (page == NULL || !PageUptodate(page)) ++ goto err; ++ e3b->bd_buddy_page = page; ++ e3b->bd_buddy = page_address(page) + (poff * sb->s_blocksize); ++ mark_page_accessed(page); ++ ++ J_ASSERT(e3b->bd_bitmap_page != NULL); ++ J_ASSERT(e3b->bd_buddy_page != NULL); ++ ++ return 0; ++ ++err: ++ if (e3b->bd_bitmap_page) ++ page_cache_release(e3b->bd_bitmap_page); ++ if (e3b->bd_buddy_page) ++ page_cache_release(e3b->bd_buddy_page); ++ e3b->bd_buddy = NULL; ++ e3b->bd_bitmap = NULL; ++ return -EIO; ++} ++ ++static void ext3_mb_release_desc(struct ext3_buddy *e3b) ++{ ++ if (e3b->bd_bitmap_page) ++ page_cache_release(e3b->bd_bitmap_page); ++ if (e3b->bd_buddy_page) ++ page_cache_release(e3b->bd_buddy_page); ++} ++ ++ ++static inline void ++ext3_lock_group(struct super_block *sb, int group) ++{ ++ bit_spin_lock(EXT3_GROUP_INFO_LOCKED_BIT, ++ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++} ++ ++static inline void ++ext3_unlock_group(struct super_block *sb, int group) ++{ ++ bit_spin_unlock(EXT3_GROUP_INFO_LOCKED_BIT, ++ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++} ++ ++static int mb_find_order_for_block(struct ext3_buddy *e3b, int block) ++{ ++ int order = 1; ++ void *bb; ++ ++ J_ASSERT(EXT3_MB_BITMAP(e3b) != EXT3_MB_BUDDY(e3b)); ++ J_ASSERT(block < (1 << (e3b->bd_blkbits + 3))); ++ ++ bb = EXT3_MB_BUDDY(e3b); ++ while (order <= e3b->bd_blkbits + 1) { ++ block = block >> 1; ++ if (!mb_test_bit(block, bb)) { ++ /* this block is part of buddy of order 'order' */ ++ return order; ++ } ++ bb += 1 << (e3b->bd_blkbits - order); ++ order++; ++ } ++ return 0; ++} ++ ++static inline void mb_clear_bits(void *bm, int cur, int len) ++{ ++ __u32 *addr; ++ ++ len = cur + len; ++ while (cur < len) { ++ if ((cur & 31) == 0 && (len - cur) >= 32) { ++ /* fast path: clear whole word at once */ ++ addr = bm + (cur >> 3); ++ *addr = 0; ++ cur += 32; ++ continue; ++ } ++ mb_clear_bit_atomic(cur, bm); ++ cur++; ++ } ++} ++ ++static inline void mb_set_bits(void *bm, int cur, int len) ++{ ++ __u32 *addr; ++ ++ len = cur + len; ++ while (cur < len) { ++ if ((cur & 31) == 0 && (len - cur) >= 32) { ++ /* fast path: clear whole word at once */ ++ addr = bm + (cur >> 3); ++ *addr = 0xffffffff; ++ cur += 32; ++ continue; ++ } ++ mb_set_bit_atomic(cur, bm); ++ cur++; ++ } ++} ++ ++static int mb_free_blocks(struct ext3_buddy *e3b, int first, int count) ++{ ++ int block = 0, max = 0, order; ++ void *buddy, *buddy2; ++ ++ mb_check_buddy(e3b); ++ ++ e3b->bd_info->bb_free += count; ++ if (first < e3b->bd_info->bb_first_free) ++ e3b->bd_info->bb_first_free = first; ++ ++ /* let's maintain fragments counter */ ++ if (first != 0) ++ block = !mb_test_bit(first - 1, EXT3_MB_BITMAP(e3b)); ++ if (first + count < EXT3_SB(e3b->bd_sb)->s_mb_maxs[0]) ++ max = !mb_test_bit(first + count, EXT3_MB_BITMAP(e3b)); ++ if (block && max) ++ e3b->bd_info->bb_fragments--; ++ else if (!block && !max) ++ e3b->bd_info->bb_fragments++; ++ ++ /* let's maintain buddy itself */ ++ while (count-- > 0) { ++ block = first++; ++ order = 0; ++ ++ J_ASSERT(mb_test_bit(block, EXT3_MB_BITMAP(e3b))); ++ mb_clear_bit(block, EXT3_MB_BITMAP(e3b)); ++ e3b->bd_info->bb_counters[order]++; ++ ++ /* start of the buddy */ ++ buddy = mb_find_buddy(e3b, order, &max); ++ ++ do { ++ block &= ~1UL; ++ if (mb_test_bit(block, buddy) || ++ mb_test_bit(block + 1, buddy)) ++ break; ++ ++ /* both the buddies are free, try to coalesce them */ ++ buddy2 = mb_find_buddy(e3b, order + 1, &max); ++ ++ if (!buddy2) ++ break; ++ ++ if (order > 0) { ++ /* for special purposes, we don't set ++ * free bits in bitmap */ ++ mb_set_bit(block, buddy); ++ mb_set_bit(block + 1, buddy); ++ } ++ e3b->bd_info->bb_counters[order]--; ++ e3b->bd_info->bb_counters[order]--; ++ ++ block = block >> 1; ++ order++; ++ e3b->bd_info->bb_counters[order]++; ++ ++ mb_clear_bit(block, buddy2); ++ buddy = buddy2; ++ } while (1); ++ } ++ mb_check_buddy(e3b); ++ ++ return 0; ++} ++ ++static int mb_find_extent(struct ext3_buddy *e3b, int order, int block, ++ int needed, struct ext3_free_extent *ex) ++{ ++ int next, max, ord; ++ void *buddy; ++ ++ J_ASSERT(ex != NULL); ++ ++ buddy = mb_find_buddy(e3b, order, &max); ++ J_ASSERT(buddy); ++ J_ASSERT(block < max); ++ if (mb_test_bit(block, buddy)) { ++ ex->fe_len = 0; ++ ex->fe_start = 0; ++ ex->fe_group = 0; ++ return 0; ++ } ++ ++ if (likely(order == 0)) { ++ /* find actual order */ ++ order = mb_find_order_for_block(e3b, block); ++ block = block >> order; ++ } ++ ++ ex->fe_len = 1 << order; ++ ex->fe_start = block << order; ++ ex->fe_group = e3b->bd_group; ++ ++ while (needed > ex->fe_len && (buddy = mb_find_buddy(e3b, order, &max))) { ++ ++ if (block + 1 >= max) ++ break; ++ ++ next = (block + 1) * (1 << order); ++ if (mb_test_bit(next, EXT3_MB_BITMAP(e3b))) ++ break; ++ ++ ord = mb_find_order_for_block(e3b, next); ++ ++ order = ord; ++ block = next >> order; ++ ex->fe_len += 1 << order; ++ } ++ ++ J_ASSERT(ex->fe_start + ex->fe_len <= (1 << (e3b->bd_blkbits + 3))); ++ return ex->fe_len; ++} ++ ++static int mb_mark_used(struct ext3_buddy *e3b, struct ext3_free_extent *ex) ++{ ++ int ord, mlen = 0, max = 0, cur; ++ int start = ex->fe_start; ++ int len = ex->fe_len; ++ unsigned ret = 0; ++ int len0 = len; ++ void *buddy; ++ ++ mb_check_buddy(e3b); ++ ++ e3b->bd_info->bb_free -= len; ++ if (e3b->bd_info->bb_first_free == start) ++ e3b->bd_info->bb_first_free += len; ++ ++ /* let's maintain fragments counter */ ++ if (start != 0) ++ mlen = !mb_test_bit(start - 1, EXT3_MB_BITMAP(e3b)); ++ if (start + len < EXT3_SB(e3b->bd_sb)->s_mb_maxs[0]) ++ max = !mb_test_bit(start + len, EXT3_MB_BITMAP(e3b)); ++ if (mlen && max) ++ e3b->bd_info->bb_fragments++; ++ else if (!mlen && !max) ++ e3b->bd_info->bb_fragments--; ++ ++ /* let's maintain buddy itself */ ++ while (len) { ++ ord = mb_find_order_for_block(e3b, start); ++ ++ if (((start >> ord) << ord) == start && len >= (1 << ord)) { ++ /* the whole chunk may be allocated at once! */ ++ mlen = 1 << ord; ++ buddy = mb_find_buddy(e3b, ord, &max); ++ J_ASSERT((start >> ord) < max); ++ mb_set_bit(start >> ord, buddy); ++ e3b->bd_info->bb_counters[ord]--; ++ start += mlen; ++ len -= mlen; ++ J_ASSERT(len >= 0); ++ continue; ++ } ++ ++ /* store for history */ ++ if (ret == 0) ++ ret = len | (ord << 16); ++ ++ /* we have to split large buddy */ ++ J_ASSERT(ord > 0); ++ buddy = mb_find_buddy(e3b, ord, &max); ++ mb_set_bit(start >> ord, buddy); ++ e3b->bd_info->bb_counters[ord]--; ++ ++ ord--; ++ cur = (start >> ord) & ~1U; ++ buddy = mb_find_buddy(e3b, ord, &max); ++ mb_clear_bit(cur, buddy); ++ mb_clear_bit(cur + 1, buddy); ++ e3b->bd_info->bb_counters[ord]++; ++ e3b->bd_info->bb_counters[ord]++; ++ } ++ ++ /* now drop all the bits in bitmap */ ++ mb_set_bits(EXT3_MB_BITMAP(e3b), ex->fe_start, len0); ++ ++ mb_check_buddy(e3b); ++ ++ return ret; ++} ++ ++/* ++ * Must be called under group lock! ++ */ ++static void ext3_mb_use_best_found(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ unsigned long ret; ++ ++ ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len); ++ ret = mb_mark_used(e3b, &ac->ac_b_ex); ++ ++ ac->ac_status = AC_STATUS_FOUND; ++ ac->ac_tail = ret & 0xffff; ++ ac->ac_buddy = ret >> 16; ++} ++ ++/* ++ * The routine checks whether found extent is good enough. If it is, ++ * then the extent gets marked used and flag is set to the context ++ * to stop scanning. Otherwise, the extent is compared with the ++ * previous found extent and if new one is better, then it's stored ++ * in the context. Later, the best found extent will be used, if ++ * mballoc can't find good enough extent. ++ * ++ * FIXME: real allocation policy is to be designed yet! ++ */ ++static void ext3_mb_measure_extent(struct ext3_allocation_context *ac, ++ struct ext3_free_extent *ex, ++ struct ext3_buddy *e3b) ++{ ++ struct ext3_free_extent *bex = &ac->ac_b_ex; ++ struct ext3_free_extent *gex = &ac->ac_g_ex; ++ ++ J_ASSERT(ex->fe_len > 0); ++ J_ASSERT(ex->fe_len < (1 << ac->ac_sb->s_blocksize_bits) * 8); ++ J_ASSERT(ex->fe_start < (1 << ac->ac_sb->s_blocksize_bits) * 8); ++ ++ ac->ac_found++; ++ ++ /* ++ * The special case - take what you catch first ++ */ ++ if (unlikely(ac->ac_flags & EXT3_MB_HINT_FIRST)) { ++ *bex = *ex; ++ ext3_mb_use_best_found(ac, e3b); ++ return; ++ } ++ ++ /* ++ * Let's check whether the chuck is good enough ++ */ ++ if (ex->fe_len == gex->fe_len) { ++ *bex = *ex; ++ ext3_mb_use_best_found(ac, e3b); ++ return; ++ } ++ ++ /* ++ * If this is first found extent, just store it in the context ++ */ ++ if (bex->fe_len == 0) { ++ *bex = *ex; ++ return; ++ } ++ ++ /* ++ * If new found extent is better, store it in the context ++ */ ++ if (bex->fe_len < gex->fe_len) { ++ /* if the request isn't satisfied, any found extent ++ * larger than previous best one is better */ ++ if (ex->fe_len > bex->fe_len) ++ *bex = *ex; ++ } else if (ex->fe_len > gex->fe_len) { ++ /* if the request is satisfied, then we try to find ++ * an extent that still satisfy the request, but is ++ * smaller than previous one */ ++ *bex = *ex; ++ } ++ ++ /* ++ * Let's scan at least few extents and don't pick up a first one ++ */ ++ if (bex->fe_len > gex->fe_len && ac->ac_found > ext3_mb_min_to_scan) ++ ac->ac_status = AC_STATUS_BREAK; ++ ++ /* ++ * We don't want to scan for a whole year ++ */ ++ if (ac->ac_found > ext3_mb_max_to_scan) ++ ac->ac_status = AC_STATUS_BREAK; ++} ++ ++static int ext3_mb_try_best_found(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct ext3_free_extent ex = ac->ac_b_ex; ++ int group = ex.fe_group, max, err; ++ ++ J_ASSERT(ex.fe_len > 0); ++ err = ext3_mb_load_buddy(ac->ac_sb, group, e3b); ++ if (err) ++ return err; ++ ++ ext3_lock_group(ac->ac_sb, group); ++ max = mb_find_extent(e3b, 0, ex.fe_start, ex.fe_len, &ex); ++ ++ if (max > 0) { ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } ++ ++ ext3_unlock_group(ac->ac_sb, group); ++ ++ ext3_mb_release_desc(e3b); ++ ++ return 0; ++} ++ ++static int ext3_mb_find_by_goal(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ int group = ac->ac_g_ex.fe_group, max, err; ++ struct ext3_free_extent ex; ++ ++ err = ext3_mb_load_buddy(ac->ac_sb, group, e3b); ++ if (err) ++ return err; ++ ++ ext3_lock_group(ac->ac_sb, group); ++ max = mb_find_extent(e3b, 0, ac->ac_g_ex.fe_start, ++ ac->ac_g_ex.fe_len, &ex); ++ ++ if (max > 0) { ++ J_ASSERT(ex.fe_len > 0); ++ J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); ++ J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } ++ ext3_unlock_group(ac->ac_sb, group); ++ ++ ext3_mb_release_desc(e3b); ++ ++ return 0; ++} ++ ++/* ++ * The routine scans buddy structures (not bitmap!) from given order ++ * to max order and tries to find big enough chunk to satisfy the req ++ */ ++static void ext3_mb_simple_scan_group(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct super_block *sb = ac->ac_sb; ++ struct ext3_group_info *grp = e3b->bd_info; ++ void *buddy; ++ int i, k, max; ++ ++ J_ASSERT(ac->ac_2order > 0); ++ for (i = ac->ac_2order; i < sb->s_blocksize_bits + 1; i++) { ++ if (grp->bb_counters[i] == 0) ++ continue; ++ ++ buddy = mb_find_buddy(e3b, i, &max); ++ if (buddy == NULL) { ++ printk(KERN_ALERT "looking for wrong order?\n"); ++ break; ++ } ++ ++ k = mb_find_next_zero_bit(buddy, max, 0); ++ J_ASSERT(k < max); ++ ++ ac->ac_found++; ++ ++ ac->ac_b_ex.fe_len = 1 << i; ++ ac->ac_b_ex.fe_start = k << i; ++ ac->ac_b_ex.fe_group = e3b->bd_group; ++ ++ ext3_mb_use_best_found(ac, e3b); ++ J_ASSERT(ac->ac_b_ex.fe_len == ac->ac_g_ex.fe_len); ++ ++ if (unlikely(ext3_mb_stats)) ++ atomic_inc(&EXT3_SB(sb)->s_bal_2orders); ++ ++ break; ++ } ++} ++ ++/* ++ * The routine scans the group and measures all found extents. ++ * In order to optimize scanning, caller must pass number of ++ * free blocks in the group, so the routine can know upper limit. ++ */ ++static void ext3_mb_complex_scan_group(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct super_block *sb = ac->ac_sb; ++ void *bitmap = EXT3_MB_BITMAP(e3b); ++ struct ext3_free_extent ex; ++ int i, free; ++ ++ free = e3b->bd_info->bb_free; ++ J_ASSERT(free > 0); ++ ++ i = e3b->bd_info->bb_first_free; ++ ++ while (free && ac->ac_status == AC_STATUS_CONTINUE) { ++ i = mb_find_next_zero_bit(bitmap, sb->s_blocksize * 8, i); ++ if (i >= sb->s_blocksize * 8) { ++ J_ASSERT(free == 0); ++ break; ++ } ++ ++ mb_find_extent(e3b, 0, i, ac->ac_g_ex.fe_len, &ex); ++ J_ASSERT(ex.fe_len > 0); ++ J_ASSERT(free >= ex.fe_len); ++ ++ ext3_mb_measure_extent(ac, &ex, e3b); ++ ++ i += ex.fe_len; ++ free -= ex.fe_len; ++ } ++} ++ ++static int ext3_mb_good_group(struct ext3_allocation_context *ac, ++ int group, int cr) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); ++ struct ext3_group_info *grp = sbi->s_group_info[group]; ++ unsigned free, fragments, i, bits; ++ ++ J_ASSERT(cr >= 0 && cr < 4); ++ J_ASSERT(!EXT3_MB_GRP_NEED_INIT(grp)); ++ ++ free = grp->bb_free; ++ fragments = grp->bb_fragments; ++ if (free == 0) ++ return 0; ++ if (fragments == 0) ++ return 0; ++ ++ switch (cr) { ++ case 0: ++ J_ASSERT(ac->ac_2order != 0); ++ bits = ac->ac_sb->s_blocksize_bits + 1; ++ for (i = ac->ac_2order; i < bits; i++) ++ if (grp->bb_counters[i] > 0) ++ return 1; ++ case 1: ++ if ((free / fragments) >= ac->ac_g_ex.fe_len) ++ return 1; ++ case 2: ++ if (free >= ac->ac_g_ex.fe_len) ++ return 1; ++ case 3: ++ return 1; ++ default: ++ BUG(); ++ } ++ ++ return 0; ++} ++ ++int ext3_mb_new_blocks(handle_t *handle, struct inode *inode, ++ unsigned long goal, int *len, int flags, int *errp) ++{ ++ struct buffer_head *bitmap_bh = NULL; ++ struct ext3_allocation_context ac; ++ int i, group, block, cr, err = 0; ++ struct ext3_group_desc *gdp; ++ struct ext3_super_block *es; ++ struct buffer_head *gdp_bh; ++ struct ext3_sb_info *sbi; ++ struct super_block *sb; ++ struct ext3_buddy e3b; ++ ++ J_ASSERT(len != NULL); ++ J_ASSERT(*len > 0); ++ ++ sb = inode->i_sb; ++ if (!sb) { ++ printk("ext3_mb_new_nblocks: nonexistent device"); ++ return 0; ++ } ++ ++ if (!test_opt(sb, MBALLOC)) { ++ static int ext3_mballoc_warning = 0; ++ if (ext3_mballoc_warning == 0) { ++ printk(KERN_ERR "EXT3-fs: multiblock request with " ++ "mballoc disabled!\n"); ++ ext3_mballoc_warning++; ++ } ++ *len = 1; ++ err = ext3_new_block_old(handle, inode, goal, errp); ++ return err; ++ } ++ ++ ext3_mb_poll_new_transaction(sb, handle); ++ ++ sbi = EXT3_SB(sb); ++ es = EXT3_SB(sb)->s_es; ++ ++ /* ++ * We can't allocate > group size ++ */ ++ if (*len >= EXT3_BLOCKS_PER_GROUP(sb) - 10) ++ *len = EXT3_BLOCKS_PER_GROUP(sb) - 10; ++ ++ if (!(flags & EXT3_MB_HINT_RESERVED)) { ++ /* someone asks for non-reserved blocks */ ++ BUG_ON(*len > 1); ++ err = ext3_mb_reserve_blocks(sb, 1); ++ if (err) { ++ *errp = err; ++ return 0; ++ } ++ } ++ ++ /* ++ * Check quota for allocation of this blocks. ++ */ ++ while (*len && DQUOT_ALLOC_BLOCK(inode, *len)) ++ *len -= 1; ++ if (*len == 0) { ++ *errp = -EDQUOT; ++ block = 0; ++ goto out; ++ } ++ ++ /* start searching from the goal */ ++ if (goal < le32_to_cpu(es->s_first_data_block) || ++ goal >= le32_to_cpu(es->s_blocks_count)) ++ goal = le32_to_cpu(es->s_first_data_block); ++ group = (goal - le32_to_cpu(es->s_first_data_block)) / ++ EXT3_BLOCKS_PER_GROUP(sb); ++ block = ((goal - le32_to_cpu(es->s_first_data_block)) % ++ EXT3_BLOCKS_PER_GROUP(sb)); ++ ++ /* set up allocation goals */ ++ ac.ac_b_ex.fe_group = 0; ++ ac.ac_b_ex.fe_start = 0; ++ ac.ac_b_ex.fe_len = 0; ++ ac.ac_status = AC_STATUS_CONTINUE; ++ ac.ac_groups_scanned = 0; ++ ac.ac_ex_scanned = 0; ++ ac.ac_found = 0; ++ ac.ac_sb = inode->i_sb; ++ ac.ac_g_ex.fe_group = group; ++ ac.ac_g_ex.fe_start = block; ++ ac.ac_g_ex.fe_len = *len; ++ ac.ac_flags = flags; ++ ac.ac_2order = 0; ++ ac.ac_criteria = 0; ++ ++ /* probably, the request is for 2^8+ blocks (1/2/3/... MB) */ ++ i = ffs(*len); ++ if (i >= 8) { ++ i--; ++ if ((*len & (~(1 << i))) == 0) ++ ac.ac_2order = i; ++ } ++ ++ /* Sometimes, caller may want to merge even small ++ * number of blocks to an existing extent */ ++ if (ac.ac_flags & EXT3_MB_HINT_MERGE) { ++ err = ext3_mb_find_by_goal(&ac, &e3b); ++ if (err) ++ goto out_err; ++ if (ac.ac_status == AC_STATUS_FOUND) ++ goto found; ++ } ++ ++ /* Let's just scan groups to find more-less suitable blocks */ ++ cr = ac.ac_2order ? 0 : 1; ++repeat: ++ for (; cr < 4 && ac.ac_status == AC_STATUS_CONTINUE; cr++) { ++ ac.ac_criteria = cr; ++ for (i = 0; i < EXT3_SB(sb)->s_groups_count; group++, i++) { ++ if (group == EXT3_SB(sb)->s_groups_count) ++ group = 0; ++ ++ if (EXT3_MB_GRP_NEED_INIT(sbi->s_group_info[group])) { ++ /* we need full data about the group ++ * to make a good selection */ ++ err = ext3_mb_load_buddy(ac.ac_sb, group, &e3b); ++ if (err) ++ goto out_err; ++ ext3_mb_release_desc(&e3b); ++ } ++ ++ /* check is group good for our criteries */ ++ if (!ext3_mb_good_group(&ac, group, cr)) ++ continue; ++ ++ err = ext3_mb_load_buddy(ac.ac_sb, group, &e3b); ++ if (err) ++ goto out_err; ++ ++ ext3_lock_group(sb, group); ++ if (!ext3_mb_good_group(&ac, group, cr)) { ++ /* someone did allocation from this group */ ++ ext3_unlock_group(sb, group); ++ ext3_mb_release_desc(&e3b); ++ continue; ++ } ++ ++ ac.ac_groups_scanned++; ++ if (cr == 0) ++ ext3_mb_simple_scan_group(&ac, &e3b); ++ else ++ ext3_mb_complex_scan_group(&ac, &e3b); ++ ++ ext3_unlock_group(sb, group); ++ ++ ext3_mb_release_desc(&e3b); ++ ++ if (err) ++ goto out_err; ++ if (ac.ac_status != AC_STATUS_CONTINUE) ++ break; ++ } ++ } ++ ++ if (ac.ac_b_ex.fe_len > 0 && ac.ac_status != AC_STATUS_FOUND && ++ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { ++ /* ++ * We've been searching too long. Let's try to allocate ++ * the best chunk we've found so far ++ */ ++ ++ /*if (ac.ac_found > ext3_mb_max_to_scan) ++ printk(KERN_ERR "EXT3-fs: too long searching at " ++ "%u (%d/%d)\n", cr, ac.ac_b_ex.fe_len, ++ ac.ac_g_ex.fe_len);*/ ++ ext3_mb_try_best_found(&ac, &e3b); ++ if (ac.ac_status != AC_STATUS_FOUND) { ++ /* ++ * Someone more lucky has already allocated it. ++ * The only thing we can do is just take first ++ * found block(s) ++ */ ++ printk(KERN_ERR "EXT3-fs: and someone won our chunk\n"); ++ ac.ac_b_ex.fe_group = 0; ++ ac.ac_b_ex.fe_start = 0; ++ ac.ac_b_ex.fe_len = 0; ++ ac.ac_status = AC_STATUS_CONTINUE; ++ ac.ac_flags |= EXT3_MB_HINT_FIRST; ++ cr = 3; ++ goto repeat; ++ } ++ } ++ ++ if (ac.ac_status != AC_STATUS_FOUND) { ++ /* ++ * We aren't lucky definitely ++ */ ++ DQUOT_FREE_BLOCK(inode, *len); ++ *errp = -ENOSPC; ++ block = 0; ++#if 1 ++ printk(KERN_ERR "EXT3-fs: cant allocate: status %d, flags %d\n", ++ ac.ac_status, ac.ac_flags); ++ printk(KERN_ERR "EXT3-fs: goal %d, best found %d/%d/%d, cr %d\n", ++ ac.ac_g_ex.fe_len, ac.ac_b_ex.fe_group, ++ ac.ac_b_ex.fe_start, ac.ac_b_ex.fe_len, cr); ++ printk(KERN_ERR "EXT3-fs: %lu block reserved, %d found\n", ++ sbi->s_blocks_reserved, ac.ac_found); ++ printk("EXT3-fs: groups: "); ++ for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) ++ printk("%d: %d ", i, ++ sbi->s_group_info[i]->bb_free); ++ printk("\n"); ++#endif ++ goto out; ++ } ++ ++found: ++ J_ASSERT(ac.ac_b_ex.fe_len > 0); ++ ++ /* good news - free block(s) have been found. now it's time ++ * to mark block(s) in good old journaled bitmap */ ++ block = ac.ac_b_ex.fe_group * EXT3_BLOCKS_PER_GROUP(sb) ++ + ac.ac_b_ex.fe_start ++ + le32_to_cpu(es->s_first_data_block); ++ ++ /* we made a desicion, now mark found blocks in good old ++ * bitmap to be journaled */ ++ ++ ext3_debug("using block group %d(%d)\n", ++ ac.ac_b_group.group, gdp->bg_free_blocks_count); ++ ++ bitmap_bh = read_block_bitmap(sb, ac.ac_b_ex.fe_group); ++ if (!bitmap_bh) { ++ *errp = -EIO; ++ goto out_err; ++ } ++ ++ err = ext3_journal_get_write_access(handle, bitmap_bh); ++ if (err) { ++ *errp = err; ++ goto out_err; ++ } ++ ++ gdp = ext3_get_group_desc(sb, ac.ac_b_ex.fe_group, &gdp_bh); ++ if (!gdp) { ++ *errp = -EIO; ++ goto out_err; ++ } ++ ++ err = ext3_journal_get_write_access(handle, gdp_bh); ++ if (err) ++ goto out_err; ++ ++ block = ac.ac_b_ex.fe_group * EXT3_BLOCKS_PER_GROUP(sb) ++ + ac.ac_b_ex.fe_start ++ + le32_to_cpu(es->s_first_data_block); ++ ++ if (block == le32_to_cpu(gdp->bg_block_bitmap) || ++ block == le32_to_cpu(gdp->bg_inode_bitmap) || ++ in_range(block, le32_to_cpu(gdp->bg_inode_table), ++ EXT3_SB(sb)->s_itb_per_group)) ++ ext3_error(sb, "ext3_new_block", ++ "Allocating block in system zone - " ++ "block = %u", block); ++#ifdef AGGRESSIVE_CHECK ++ for (i = 0; i < ac.ac_b_ex.fe_len; i++) ++ J_ASSERT(!mb_test_bit(ac.ac_b_ex.fe_start + i, bitmap_bh->b_data)); ++#endif ++ mb_set_bits(bitmap_bh->b_data, ac.ac_b_ex.fe_start, ac.ac_b_ex.fe_len); ++ ++ spin_lock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group)); ++ gdp->bg_free_blocks_count = ++ cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) ++ - ac.ac_b_ex.fe_len); ++ spin_unlock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group)); ++ percpu_counter_mod(&sbi->s_freeblocks_counter, - ac.ac_b_ex.fe_len); ++ ++ err = ext3_journal_dirty_metadata(handle, bitmap_bh); ++ if (err) ++ goto out_err; ++ err = ext3_journal_dirty_metadata(handle, gdp_bh); ++ if (err) ++ goto out_err; ++ ++ sb->s_dirt = 1; ++ *errp = 0; ++ brelse(bitmap_bh); ++ ++ /* drop non-allocated, but dquote'd blocks */ ++ J_ASSERT(*len >= ac.ac_b_ex.fe_len); ++ DQUOT_FREE_BLOCK(inode, *len - ac.ac_b_ex.fe_len); ++ ++ *len = ac.ac_b_ex.fe_len; ++ J_ASSERT(*len > 0); ++ J_ASSERT(block != 0); ++ goto out; ++ ++out_err: ++ /* if we've already allocated something, roll it back */ ++ if (ac.ac_status == AC_STATUS_FOUND) { ++ /* FIXME: free blocks here */ ++ } ++ ++ DQUOT_FREE_BLOCK(inode, *len); ++ brelse(bitmap_bh); ++ *errp = err; ++ block = 0; ++out: ++ if (!(flags & EXT3_MB_HINT_RESERVED)) { ++ /* block wasn't reserved before and we reserved it ++ * at the beginning of allocation. it doesn't matter ++ * whether we allocated anything or we failed: time ++ * to release reservation. NOTE: because I expect ++ * any multiblock request from delayed allocation ++ * path only, here is single block always */ ++ ext3_mb_release_blocks(sb, 1); ++ } ++ ++ if (unlikely(ext3_mb_stats) && ac.ac_g_ex.fe_len > 1) { ++ atomic_inc(&sbi->s_bal_reqs); ++ atomic_add(*len, &sbi->s_bal_allocated); ++ if (*len >= ac.ac_g_ex.fe_len) ++ atomic_inc(&sbi->s_bal_success); ++ atomic_add(ac.ac_found, &sbi->s_bal_ex_scanned); ++ if (ac.ac_g_ex.fe_start == ac.ac_b_ex.fe_start && ++ ac.ac_g_ex.fe_group == ac.ac_b_ex.fe_group) ++ atomic_inc(&sbi->s_bal_goals); ++ if (ac.ac_found > ext3_mb_max_to_scan) ++ atomic_inc(&sbi->s_bal_breaks); ++ } ++ ++ ext3_mb_store_history(sb, &ac); ++ ++ return block; ++} ++EXPORT_SYMBOL(ext3_mb_new_blocks); ++ ++#ifdef EXT3_MB_HISTORY ++struct ext3_mb_proc_session { ++ struct ext3_mb_history *history; ++ struct super_block *sb; ++ int start; ++ int max; ++}; ++ ++static void *ext3_mb_history_skip_empty(struct ext3_mb_proc_session *s, ++ struct ext3_mb_history *hs, ++ int first) ++{ ++ if (hs == s->history + s->max) ++ hs = s->history; ++ if (!first && hs == s->history + s->start) ++ return NULL; ++ while (hs->goal.fe_len == 0) { ++ hs++; ++ if (hs == s->history + s->max) ++ hs = s->history; ++ if (hs == s->history + s->start) ++ return NULL; ++ } ++ return hs; ++} ++ ++static void *ext3_mb_seq_history_start(struct seq_file *seq, loff_t *pos) ++{ ++ struct ext3_mb_proc_session *s = seq->private; ++ struct ext3_mb_history *hs; ++ int l = *pos; ++ ++ if (l == 0) ++ return SEQ_START_TOKEN; ++ hs = ext3_mb_history_skip_empty(s, s->history + s->start, 1); ++ if (!hs) ++ return NULL; ++ while (--l && (hs = ext3_mb_history_skip_empty(s, ++hs, 0)) != NULL); ++ return hs; ++} ++ ++static void *ext3_mb_seq_history_next(struct seq_file *seq, void *v, loff_t *pos) ++{ ++ struct ext3_mb_proc_session *s = seq->private; ++ struct ext3_mb_history *hs = v; ++ ++ ++*pos; ++ if (v == SEQ_START_TOKEN) ++ return ext3_mb_history_skip_empty(s, s->history + s->start, 1); ++ else ++ return ext3_mb_history_skip_empty(s, ++hs, 0); ++} ++ ++static int ext3_mb_seq_history_show(struct seq_file *seq, void *v) ++{ ++ struct ext3_mb_history *hs = v; ++ char buf[20], buf2[20]; ++ ++ if (v == SEQ_START_TOKEN) { ++ seq_printf(seq, "%-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", ++ "goal", "result", "found", "grps", "cr", "merge", ++ "tail", "broken"); ++ return 0; ++ } ++ ++ sprintf(buf, "%u/%u/%u", hs->goal.fe_group, ++ hs->goal.fe_start, hs->goal.fe_len); ++ sprintf(buf2, "%u/%u/%u", hs->result.fe_group, ++ hs->result.fe_start, hs->result.fe_len); ++ seq_printf(seq, "%-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", buf, ++ buf2, hs->found, hs->groups, hs->cr, ++ hs->merged ? "M" : "", hs->tail, ++ hs->buddy ? 1 << hs->buddy : 0); ++ return 0; ++} ++ ++static void ext3_mb_seq_history_stop(struct seq_file *seq, void *v) ++{ ++} ++ ++static struct seq_operations ext3_mb_seq_history_ops = { ++ .start = ext3_mb_seq_history_start, ++ .next = ext3_mb_seq_history_next, ++ .stop = ext3_mb_seq_history_stop, ++ .show = ext3_mb_seq_history_show, ++}; ++ ++static int ext3_mb_seq_history_open(struct inode *inode, struct file *file) ++{ ++ struct super_block *sb = PDE(inode)->data; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct ext3_mb_proc_session *s; ++ int rc, size; ++ ++ s = kmalloc(sizeof(*s), GFP_KERNEL); ++ if (s == NULL) ++ return -EIO; ++ size = sizeof(struct ext3_mb_history) * sbi->s_mb_history_max; ++ s->history = kmalloc(size, GFP_KERNEL); ++ if (s == NULL) { ++ kfree(s); ++ return -EIO; ++ } ++ ++ spin_lock(&sbi->s_mb_history_lock); ++ memcpy(s->history, sbi->s_mb_history, size); ++ s->max = sbi->s_mb_history_max; ++ s->start = sbi->s_mb_history_cur % s->max; ++ spin_unlock(&sbi->s_mb_history_lock); ++ ++ rc = seq_open(file, &ext3_mb_seq_history_ops); ++ if (rc == 0) { ++ struct seq_file *m = (struct seq_file *)file->private_data; ++ m->private = s; ++ } else { ++ kfree(s->history); ++ kfree(s); ++ } ++ return rc; ++ ++} ++ ++static int ext3_mb_seq_history_release(struct inode *inode, struct file *file) ++{ ++ struct seq_file *seq = (struct seq_file *)file->private_data; ++ struct ext3_mb_proc_session *s = seq->private; ++ kfree(s->history); ++ kfree(s); ++ return seq_release(inode, file); ++} ++ ++static struct file_operations ext3_mb_seq_history_fops = { ++ .owner = THIS_MODULE, ++ .open = ext3_mb_seq_history_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = ext3_mb_seq_history_release, ++}; ++ ++static void ext3_mb_history_release(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ char name[64]; ++ ++ snprintf(name, sizeof(name) - 1, "%s", bdevname(sb->s_bdev, name)); ++ remove_proc_entry("mb_history", sbi->s_mb_proc); ++ remove_proc_entry(name, proc_root_ext3); ++ ++ if (sbi->s_mb_history) ++ kfree(sbi->s_mb_history); ++} ++ ++static void ext3_mb_history_init(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ char name[64]; ++ int i; ++ ++ snprintf(name, sizeof(name) - 1, "%s", bdevname(sb->s_bdev, name)); ++ sbi->s_mb_proc = proc_mkdir(name, proc_root_ext3); ++ if (sbi->s_mb_proc != NULL) { ++ struct proc_dir_entry *p; ++ p = create_proc_entry("mb_history", S_IRUGO, sbi->s_mb_proc); ++ if (p) { ++ p->proc_fops = &ext3_mb_seq_history_fops; ++ p->data = sb; ++ } ++ } ++ ++ sbi->s_mb_history_max = 1000; ++ sbi->s_mb_history_cur = 0; ++ spin_lock_init(&sbi->s_mb_history_lock); ++ i = sbi->s_mb_history_max * sizeof(struct ext3_mb_history); ++ sbi->s_mb_history = kmalloc(i, GFP_KERNEL); ++ memset(sbi->s_mb_history, 0, i); ++ /* if we can't allocate history, then we simple won't use it */ ++} ++ ++static void ++ext3_mb_store_history(struct super_block *sb, struct ext3_allocation_context *ac) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct ext3_mb_history h; ++ ++ if (likely(sbi->s_mb_history == NULL)) ++ return; ++ ++ h.goal = ac->ac_g_ex; ++ h.result = ac->ac_b_ex; ++ h.found = ac->ac_found; ++ h.cr = ac->ac_criteria; ++ h.groups = ac->ac_groups_scanned; ++ h.tail = ac->ac_tail; ++ h.buddy = ac->ac_buddy; ++ h.merged = 0; ++ if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start && ++ ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group) ++ h.merged = 1; ++ ++ spin_lock(&sbi->s_mb_history_lock); ++ memcpy(sbi->s_mb_history + sbi->s_mb_history_cur, &h, sizeof(h)); ++ if (++sbi->s_mb_history_cur >= sbi->s_mb_history_max) ++ sbi->s_mb_history_cur = 0; ++ spin_unlock(&sbi->s_mb_history_lock); ++} ++ ++#else ++#define ext3_mb_history_release(sb) ++#define ext3_mb_history_init(sb) ++#endif ++ ++int ext3_mb_init_backend(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int i, len; ++ ++ len = sizeof(struct ext3_buddy_group_blocks *) * sbi->s_groups_count; ++ sbi->s_group_info = kmalloc(len, GFP_KERNEL); ++ if (sbi->s_group_info == NULL) { ++ printk(KERN_ERR "EXT3-fs: can't allocate mem for buddy\n"); ++ return -ENOMEM; ++ } ++ memset(sbi->s_group_info, 0, len); ++ ++ sbi->s_buddy_cache = new_inode(sb); ++ if (sbi->s_buddy_cache == NULL) { ++ printk(KERN_ERR "EXT3-fs: can't get new inode\n"); ++ kfree(sbi->s_group_info); ++ return -ENOMEM; ++ } ++ ++ /* ++ * calculate needed size. if change bb_counters size, ++ * don't forget about ext3_mb_generate_buddy() ++ */ ++ len = sizeof(struct ext3_group_info); ++ len += sizeof(unsigned short) * (sb->s_blocksize_bits + 2); ++ for (i = 0; i < sbi->s_groups_count; i++) { ++ struct ext3_group_desc * desc; ++ ++ sbi->s_group_info[i] = kmalloc(len, GFP_KERNEL); ++ if (sbi->s_group_info[i] == NULL) { ++ printk(KERN_ERR "EXT3-fs: cant allocate mem for buddy\n"); ++ goto err_out; ++ } ++ desc = ext3_get_group_desc(sb, i, NULL); ++ if (desc == NULL) { ++ printk(KERN_ERR "EXT3-fs: cant read descriptor %u\n", i); ++ goto err_out; ++ } ++ memset(sbi->s_group_info[i], 0, len); ++ set_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, ++ &sbi->s_group_info[i]->bb_state); ++ sbi->s_group_info[i]->bb_free = ++ le16_to_cpu(desc->bg_free_blocks_count); ++ } ++ ++ return 0; ++ ++err_out: ++ while (--i >= 0) ++ kfree(sbi->s_group_info[i]); ++ iput(sbi->s_buddy_cache); ++ ++ return -ENOMEM; ++} ++ ++int ext3_mb_init(struct super_block *sb, int needs_recovery) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct inode *root = sb->s_root->d_inode; ++ unsigned i, offset, max; ++ struct dentry *dentry; ++ ++ if (!test_opt(sb, MBALLOC)) ++ return 0; ++ ++ i = (sb->s_blocksize_bits + 2) * sizeof(unsigned short); ++ ++ sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL); ++ if (sbi->s_mb_offsets == NULL) { ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ return -ENOMEM; ++ } ++ sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL); ++ if (sbi->s_mb_maxs == NULL) { ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ kfree(sbi->s_mb_maxs); ++ return -ENOMEM; ++ } ++ ++ /* order 0 is regular bitmap */ ++ sbi->s_mb_maxs[0] = sb->s_blocksize << 3; ++ sbi->s_mb_offsets[0] = 0; ++ ++ i = 1; ++ offset = 0; ++ max = sb->s_blocksize << 2; ++ do { ++ sbi->s_mb_offsets[i] = offset; ++ sbi->s_mb_maxs[i] = max; ++ offset += 1 << (sb->s_blocksize_bits - i); ++ max = max >> 1; ++ i++; ++ } while (i <= sb->s_blocksize_bits + 1); ++ ++ ++ /* init file for buddy data */ ++ if ((i = ext3_mb_init_backend(sb))) { ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ kfree(sbi->s_mb_offsets); ++ kfree(sbi->s_mb_maxs); ++ return i; ++ } ++ ++ spin_lock_init(&sbi->s_reserve_lock); ++ spin_lock_init(&sbi->s_md_lock); ++ INIT_LIST_HEAD(&sbi->s_active_transaction); ++ INIT_LIST_HEAD(&sbi->s_closed_transaction); ++ INIT_LIST_HEAD(&sbi->s_committed_transaction); ++ spin_lock_init(&sbi->s_bal_lock); ++ ++ /* remove old on-disk buddy file */ ++ mutex_lock(&root->i_mutex); ++ dentry = lookup_one_len(".buddy", sb->s_root, strlen(".buddy")); ++ if (dentry->d_inode != NULL) { ++ i = vfs_unlink(root, dentry); ++ if (i != 0) ++ printk("EXT3-fs: can't remove .buddy file: %d\n", i); ++ } ++ dput(dentry); ++ mutex_unlock(&root->i_mutex); ++ ++ ext3_mb_history_init(sb); ++ ++ printk("EXT3-fs: mballoc enabled\n"); ++ return 0; ++} ++ ++int ext3_mb_release(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int i; ++ ++ if (!test_opt(sb, MBALLOC)) ++ return 0; ++ ++ /* release freed, non-committed blocks */ ++ spin_lock(&sbi->s_md_lock); ++ list_splice_init(&sbi->s_closed_transaction, ++ &sbi->s_committed_transaction); ++ list_splice_init(&sbi->s_active_transaction, ++ &sbi->s_committed_transaction); ++ spin_unlock(&sbi->s_md_lock); ++ ext3_mb_free_committed_blocks(sb); ++ ++ if (sbi->s_group_info) { ++ for (i = 0; i < sbi->s_groups_count; i++) { ++ if (sbi->s_group_info[i] == NULL) ++ continue; ++ kfree(sbi->s_group_info[i]); ++ } ++ kfree(sbi->s_group_info); ++ } ++ if (sbi->s_mb_offsets) ++ kfree(sbi->s_mb_offsets); ++ if (sbi->s_mb_maxs) ++ kfree(sbi->s_mb_maxs); ++ if (sbi->s_buddy_cache) ++ iput(sbi->s_buddy_cache); ++ if (sbi->s_blocks_reserved) ++ printk("ext3-fs: %ld blocks being reserved at umount!\n", ++ sbi->s_blocks_reserved); ++ if (ext3_mb_stats) { ++ printk("EXT3-fs: mballoc: %u blocks %u reqs (%u success)\n", ++ atomic_read(&sbi->s_bal_allocated), ++ atomic_read(&sbi->s_bal_reqs), ++ atomic_read(&sbi->s_bal_success)); ++ printk("EXT3-fs: mballoc: %u extents scanned, %u goal hits, " ++ "%u 2^N hits, %u breaks\n", ++ atomic_read(&sbi->s_bal_ex_scanned), ++ atomic_read(&sbi->s_bal_goals), ++ atomic_read(&sbi->s_bal_2orders), ++ atomic_read(&sbi->s_bal_breaks)); ++ printk("EXT3-fs: mballoc: %lu generated and it took %Lu\n", ++ sbi->s_mb_buddies_generated++, ++ sbi->s_mb_generation_time); ++ } ++ ++ ext3_mb_history_release(sb); ++ ++ return 0; ++} ++ ++void ext3_mb_free_committed_blocks(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int err, i, count = 0, count2 = 0; ++ struct ext3_free_metadata *md; ++ struct ext3_buddy e3b; ++ ++ if (list_empty(&sbi->s_committed_transaction)) ++ return; ++ ++ /* there is committed blocks to be freed yet */ ++ do { ++ /* get next array of blocks */ ++ md = NULL; ++ spin_lock(&sbi->s_md_lock); ++ if (!list_empty(&sbi->s_committed_transaction)) { ++ md = list_entry(sbi->s_committed_transaction.next, ++ struct ext3_free_metadata, list); ++ list_del(&md->list); ++ } ++ spin_unlock(&sbi->s_md_lock); ++ ++ if (md == NULL) ++ break; ++ ++ mb_debug("gonna free %u blocks in group %u (0x%p):", ++ md->num, md->group, md); ++ ++ err = ext3_mb_load_buddy(sb, md->group, &e3b); ++ BUG_ON(err != 0); ++ ++ /* there are blocks to put in buddy to make them really free */ ++ count += md->num; ++ count2++; ++ ext3_lock_group(sb, md->group); ++ for (i = 0; i < md->num; i++) { ++ mb_debug(" %u", md->blocks[i]); ++ mb_free_blocks(&e3b, md->blocks[i], 1); ++ } ++ mb_debug("\n"); ++ ext3_unlock_group(sb, md->group); ++ ++ /* balance refcounts from ext3_mb_free_metadata() */ ++ page_cache_release(e3b.bd_buddy_page); ++ page_cache_release(e3b.bd_bitmap_page); ++ ++ kfree(md); ++ ext3_mb_release_desc(&e3b); ++ ++ } while (md); ++ mb_debug("freed %u blocks in %u structures\n", count, count2); ++} ++ ++void ext3_mb_poll_new_transaction(struct super_block *sb, handle_t *handle) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ ++ if (sbi->s_last_transaction == handle->h_transaction->t_tid) ++ return; ++ ++ /* new transaction! time to close last one and free blocks for ++ * committed transaction. we know that only transaction can be ++ * active, so previos transaction can be being logged and we ++ * know that transaction before previous is known to be already ++ * logged. this means that now we may free blocks freed in all ++ * transactions before previous one. hope I'm clear enough ... */ ++ ++ spin_lock(&sbi->s_md_lock); ++ if (sbi->s_last_transaction != handle->h_transaction->t_tid) { ++ mb_debug("new transaction %lu, old %lu\n", ++ (unsigned long) handle->h_transaction->t_tid, ++ (unsigned long) sbi->s_last_transaction); ++ list_splice_init(&sbi->s_closed_transaction, ++ &sbi->s_committed_transaction); ++ list_splice_init(&sbi->s_active_transaction, ++ &sbi->s_closed_transaction); ++ sbi->s_last_transaction = handle->h_transaction->t_tid; ++ } ++ spin_unlock(&sbi->s_md_lock); ++ ++ ext3_mb_free_committed_blocks(sb); ++} ++ ++int ext3_mb_free_metadata(handle_t *handle, struct ext3_buddy *e3b, ++ int group, int block, int count) ++{ ++ struct ext3_group_info *db = e3b->bd_info; ++ struct super_block *sb = e3b->bd_sb; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct ext3_free_metadata *md; ++ int i; ++ ++ J_ASSERT(e3b->bd_bitmap_page != NULL); ++ J_ASSERT(e3b->bd_buddy_page != NULL); ++ ++ ext3_lock_group(sb, group); ++ for (i = 0; i < count; i++) { ++ md = db->bb_md_cur; ++ if (md && db->bb_tid != handle->h_transaction->t_tid) { ++ db->bb_md_cur = NULL; ++ md = NULL; ++ } ++ ++ if (md == NULL) { ++ ext3_unlock_group(sb, group); ++ md = kmalloc(sizeof(*md), GFP_KERNEL); ++ if (md == NULL) ++ return -ENOMEM; ++ md->num = 0; ++ md->group = group; ++ ++ ext3_lock_group(sb, group); ++ if (db->bb_md_cur == NULL) { ++ spin_lock(&sbi->s_md_lock); ++ list_add(&md->list, &sbi->s_active_transaction); ++ spin_unlock(&sbi->s_md_lock); ++ /* protect buddy cache from being freed, ++ * otherwise we'll refresh it from ++ * on-disk bitmap and lose not-yet-available ++ * blocks */ ++ page_cache_get(e3b->bd_buddy_page); ++ page_cache_get(e3b->bd_bitmap_page); ++ db->bb_md_cur = md; ++ db->bb_tid = handle->h_transaction->t_tid; ++ mb_debug("new md 0x%p for group %u\n", ++ md, md->group); ++ } else { ++ kfree(md); ++ md = db->bb_md_cur; ++ } ++ } ++ ++ BUG_ON(md->num >= EXT3_BB_MAX_BLOCKS); ++ md->blocks[md->num] = block + i; ++ md->num++; ++ if (md->num == EXT3_BB_MAX_BLOCKS) { ++ /* no more space, put full container on a sb's list */ ++ db->bb_md_cur = NULL; ++ } ++ } ++ ext3_unlock_group(sb, group); ++ return 0; ++} ++ ++void ext3_mb_free_blocks(handle_t *handle, struct inode *inode, ++ unsigned long block, unsigned long count, ++ int metadata, int *freed) ++{ ++ struct buffer_head *bitmap_bh = NULL; ++ struct ext3_group_desc *gdp; ++ struct ext3_super_block *es; ++ unsigned long bit, overflow; ++ struct buffer_head *gd_bh; ++ unsigned long block_group; ++ struct ext3_sb_info *sbi; ++ struct super_block *sb; ++ struct ext3_buddy e3b; ++ int err = 0, ret; ++ ++ *freed = 0; ++ sb = inode->i_sb; ++ if (!sb) { ++ printk ("ext3_free_blocks: nonexistent device"); ++ return; ++ } ++ ++ ext3_mb_poll_new_transaction(sb, handle); ++ ++ sbi = EXT3_SB(sb); ++ es = EXT3_SB(sb)->s_es; ++ if (block < le32_to_cpu(es->s_first_data_block) || ++ block + count < block || ++ block + count > le32_to_cpu(es->s_blocks_count)) { ++ ext3_error (sb, "ext3_free_blocks", ++ "Freeing blocks not in datazone - " ++ "block = %lu, count = %lu", block, count); ++ goto error_return; ++ } ++ ++ ext3_debug("freeing block %lu\n", block); ++ ++do_more: ++ overflow = 0; ++ block_group = (block - le32_to_cpu(es->s_first_data_block)) / ++ EXT3_BLOCKS_PER_GROUP(sb); ++ bit = (block - le32_to_cpu(es->s_first_data_block)) % ++ EXT3_BLOCKS_PER_GROUP(sb); ++ /* ++ * Check to see if we are freeing blocks across a group ++ * boundary. ++ */ ++ if (bit + count > EXT3_BLOCKS_PER_GROUP(sb)) { ++ overflow = bit + count - EXT3_BLOCKS_PER_GROUP(sb); ++ count -= overflow; ++ } ++ brelse(bitmap_bh); ++ bitmap_bh = read_block_bitmap(sb, block_group); ++ if (!bitmap_bh) ++ goto error_return; ++ gdp = ext3_get_group_desc (sb, block_group, &gd_bh); ++ if (!gdp) ++ goto error_return; ++ ++ if (in_range (le32_to_cpu(gdp->bg_block_bitmap), block, count) || ++ in_range (le32_to_cpu(gdp->bg_inode_bitmap), block, count) || ++ in_range (block, le32_to_cpu(gdp->bg_inode_table), ++ EXT3_SB(sb)->s_itb_per_group) || ++ in_range (block + count - 1, le32_to_cpu(gdp->bg_inode_table), ++ EXT3_SB(sb)->s_itb_per_group)) ++ ext3_error (sb, "ext3_free_blocks", ++ "Freeing blocks in system zones - " ++ "Block = %lu, count = %lu", ++ block, count); ++ ++ BUFFER_TRACE(bitmap_bh, "getting write access"); ++ err = ext3_journal_get_write_access(handle, bitmap_bh); ++ if (err) ++ goto error_return; ++ ++ /* ++ * We are about to modify some metadata. Call the journal APIs ++ * to unshare ->b_data if a currently-committing transaction is ++ * using it ++ */ ++ BUFFER_TRACE(gd_bh, "get_write_access"); ++ err = ext3_journal_get_write_access(handle, gd_bh); ++ if (err) ++ goto error_return; ++ ++ err = ext3_mb_load_buddy(sb, block_group, &e3b); ++ if (err) ++ goto error_return; ++ ++#ifdef AGGRESSIVE_CHECK ++ { ++ int i; ++ for (i = 0; i < count; i++) ++ J_ASSERT(mb_test_bit(bit + i, bitmap_bh->b_data)); ++ } ++#endif ++ mb_clear_bits(bitmap_bh->b_data, bit, count); ++ ++ /* We dirtied the bitmap block */ ++ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); ++ err = ext3_journal_dirty_metadata(handle, bitmap_bh); ++ ++ if (metadata) { ++ /* blocks being freed are metadata. these blocks shouldn't ++ * be used until this transaction is committed */ ++ ext3_mb_free_metadata(handle, &e3b, block_group, bit, count); ++ } else { ++ ext3_lock_group(sb, block_group); ++ mb_free_blocks(&e3b, bit, count); ++ ext3_unlock_group(sb, block_group); ++ } ++ ++ spin_lock(sb_bgl_lock(sbi, block_group)); ++ gdp->bg_free_blocks_count = ++ cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) + count); ++ spin_unlock(sb_bgl_lock(sbi, block_group)); ++ percpu_counter_mod(&sbi->s_freeblocks_counter, count); ++ ++ ext3_mb_release_desc(&e3b); ++ ++ *freed = count; ++ ++ /* And the group descriptor block */ ++ BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); ++ ret = ext3_journal_dirty_metadata(handle, gd_bh); ++ if (!err) err = ret; ++ ++ if (overflow && !err) { ++ block += count; ++ count = overflow; ++ goto do_more; ++ } ++ sb->s_dirt = 1; ++error_return: ++ brelse(bitmap_bh); ++ ext3_std_error(sb, err); ++ return; ++} ++ ++int ext3_mb_reserve_blocks(struct super_block *sb, int blocks) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int free, ret = -ENOSPC; ++ ++ BUG_ON(blocks < 0); ++ spin_lock(&sbi->s_reserve_lock); ++ free = percpu_counter_read_positive(&sbi->s_freeblocks_counter); ++ if (blocks <= free - sbi->s_blocks_reserved) { ++ sbi->s_blocks_reserved += blocks; ++ ret = 0; ++ } ++ spin_unlock(&sbi->s_reserve_lock); ++ return ret; ++} ++ ++void ext3_mb_release_blocks(struct super_block *sb, int blocks) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ ++ BUG_ON(blocks < 0); ++ spin_lock(&sbi->s_reserve_lock); ++ sbi->s_blocks_reserved -= blocks; ++ WARN_ON(sbi->s_blocks_reserved < 0); ++ if (sbi->s_blocks_reserved < 0) ++ sbi->s_blocks_reserved = 0; ++ spin_unlock(&sbi->s_reserve_lock); ++} ++ ++int ext3_new_block(handle_t *handle, struct inode *inode, ++ unsigned long goal, int *errp) ++{ ++ int ret, len; ++ ++ if (!test_opt(inode->i_sb, MBALLOC)) { ++ ret = ext3_new_block_old(handle, inode, goal, errp); ++ goto out; ++ } ++ len = 1; ++ ret = ext3_mb_new_blocks(handle, inode, goal, &len, 0, errp); ++out: ++ return ret; ++} ++ ++ ++void ext3_free_blocks(handle_t *handle, struct inode * inode, ++ unsigned long block, unsigned long count, int metadata) ++{ ++ struct super_block *sb; ++ int freed; ++ ++ sb = inode->i_sb; ++ if (!test_opt(sb, MBALLOC)) ++ ext3_free_blocks_sb(handle, sb, block, count, &freed); ++ else ++ ext3_mb_free_blocks(handle, inode, block, count, metadata, &freed); ++ if (freed) ++ DQUOT_FREE_BLOCK(inode, freed); ++ return; ++} ++ ++#define EXT3_ROOT "ext3" ++#define EXT3_MB_STATS_NAME "mb_stats" ++#define EXT3_MB_MAX_TO_SCAN_NAME "mb_max_to_scan" ++#define EXT3_MB_MIN_TO_SCAN_NAME "mb_min_to_scan" ++ ++static int ext3_mb_stats_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_stats); ++ *start = page; ++ return len; ++} ++ ++static int ext3_mb_stats_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3: %s string to long, max %u bytes\n", ++ EXT3_MB_STATS_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ ext3_mb_stats = (simple_strtol(str, NULL, 0) != 0); ++ return count; ++} ++ ++static int ext3_mb_max_to_scan_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_max_to_scan); ++ *start = page; ++ return len; ++} ++ ++static int ext3_mb_max_to_scan_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ long value; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3: %s string to long, max %u bytes\n", ++ EXT3_MB_MAX_TO_SCAN_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ value = simple_strtol(str, NULL, 0); ++ if (value <= 0) ++ return -ERANGE; ++ ++ ext3_mb_max_to_scan = value; ++ ++ return count; ++} ++ ++static int ext3_mb_min_to_scan_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_min_to_scan); ++ *start = page; ++ return len; ++} ++ ++static int ext3_mb_min_to_scan_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ long value; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3: %s string to long, max %u bytes\n", ++ EXT3_MB_MIN_TO_SCAN_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ value = simple_strtol(str, NULL, 0); ++ if (value <= 0) ++ return -ERANGE; ++ ++ ext3_mb_min_to_scan = value; ++ ++ return count; ++} ++ ++int __init init_ext3_proc(void) ++{ ++ struct proc_dir_entry *proc_ext3_mb_stats; ++ struct proc_dir_entry *proc_ext3_mb_max_to_scan; ++ struct proc_dir_entry *proc_ext3_mb_min_to_scan; ++ ++ proc_root_ext3 = proc_mkdir(EXT3_ROOT, proc_root_fs); ++ if (proc_root_ext3 == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", EXT3_ROOT); ++ return -EIO; ++ } ++ ++ /* Initialize EXT3_MB_STATS_NAME */ ++ proc_ext3_mb_stats = create_proc_entry(EXT3_MB_STATS_NAME, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_stats == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ EXT3_MB_STATS_NAME); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_stats->data = NULL; ++ proc_ext3_mb_stats->read_proc = ext3_mb_stats_read; ++ proc_ext3_mb_stats->write_proc = ext3_mb_stats_write; ++ ++ /* Initialize EXT3_MAX_TO_SCAN_NAME */ ++ proc_ext3_mb_max_to_scan = create_proc_entry( ++ EXT3_MB_MAX_TO_SCAN_NAME, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_max_to_scan == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ EXT3_MB_MAX_TO_SCAN_NAME); ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_max_to_scan->data = NULL; ++ proc_ext3_mb_max_to_scan->read_proc = ext3_mb_max_to_scan_read; ++ proc_ext3_mb_max_to_scan->write_proc = ext3_mb_max_to_scan_write; ++ ++ /* Initialize EXT3_MIN_TO_SCAN_NAME */ ++ proc_ext3_mb_min_to_scan = create_proc_entry( ++ EXT3_MB_MIN_TO_SCAN_NAME, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_min_to_scan == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ EXT3_MB_MIN_TO_SCAN_NAME); ++ remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_min_to_scan->data = NULL; ++ proc_ext3_mb_min_to_scan->read_proc = ext3_mb_min_to_scan_read; ++ proc_ext3_mb_min_to_scan->write_proc = ext3_mb_min_to_scan_write; ++ ++ return 0; ++} ++ ++void exit_ext3_proc(void) ++{ ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++} ++ +Index: linux-2.6.16.i686/fs/ext3/extents.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/extents.c 2006-05-30 22:55:32.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/extents.c 2006-05-30 23:02:59.000000000 +0800 +@@ -771,7 +771,7 @@ + for (i = 0; i < depth; i++) { + if (!ablocks[i]) + continue; +- ext3_free_blocks(handle, tree->inode, ablocks[i], 1); ++ ext3_free_blocks(handle, tree->inode, ablocks[i], 1, 1); + } + } + kfree(ablocks); +@@ -1428,7 +1428,7 @@ + path->p_idx->ei_leaf); + bh = sb_find_get_block(tree->inode->i_sb, path->p_idx->ei_leaf); + ext3_forget(handle, 1, tree->inode, bh, path->p_idx->ei_leaf); +- ext3_free_blocks(handle, tree->inode, path->p_idx->ei_leaf, 1); ++ ext3_free_blocks(handle, tree->inode, path->p_idx->ei_leaf, 1, 1); + return err; + } + +@@ -1913,10 +1913,12 @@ + int needed = ext3_remove_blocks_credits(tree, ex, from, to); + handle_t *handle = ext3_journal_start(tree->inode, needed); + struct buffer_head *bh; +- int i; ++ int i, metadata = 0; + + if (IS_ERR(handle)) + return PTR_ERR(handle); ++ if (S_ISDIR(tree->inode->i_mode) || S_ISLNK(tree->inode->i_mode)) ++ metadata = 1; + if (from >= ex->ee_block && to == ex->ee_block + ex->ee_len - 1) { + /* tail removal */ + unsigned long num, start; +@@ -1928,7 +1930,7 @@ + bh = sb_find_get_block(tree->inode->i_sb, start + i); + ext3_forget(handle, 0, tree->inode, bh, start + i); + } +- ext3_free_blocks(handle, tree->inode, start, num); ++ ext3_free_blocks(handle, tree->inode, start, num, metadata); + } else if (from == ex->ee_block && to <= ex->ee_block + ex->ee_len - 1) { + printk("strange request: removal %lu-%lu from %u:%u\n", + from, to, ex->ee_block, ex->ee_len); +Index: linux-2.6.16.i686/fs/ext3/xattr.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/xattr.c 2006-03-20 13:53:29.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/xattr.c 2006-05-30 23:02:59.000000000 +0800 +@@ -484,7 +484,7 @@ + ea_bdebug(bh, "refcount now=0; freeing"); + if (ce) + mb_cache_entry_free(ce); +- ext3_free_blocks(handle, inode, bh->b_blocknr, 1); ++ ext3_free_blocks(handle, inode, bh->b_blocknr, 1, 1); + get_bh(bh); + ext3_forget(handle, 1, inode, bh, bh->b_blocknr); + } else { +@@ -804,7 +804,7 @@ + new_bh = sb_getblk(sb, block); + if (!new_bh) { + getblk_failed: +- ext3_free_blocks(handle, inode, block, 1); ++ ext3_free_blocks(handle, inode, block, 1, 1); + error = -EIO; + goto cleanup; + } +Index: linux-2.6.16.i686/fs/ext3/balloc.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/balloc.c 2006-03-20 13:53:29.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/balloc.c 2006-05-30 23:02:59.000000000 +0800 +@@ -80,7 +80,7 @@ + * + * Return buffer_head on success or NULL in case of failure. + */ +-static struct buffer_head * ++struct buffer_head * + read_block_bitmap(struct super_block *sb, unsigned int block_group) + { + struct ext3_group_desc * desc; +@@ -491,24 +491,6 @@ + return; + } + +-/* Free given blocks, update quota and i_blocks field */ +-void ext3_free_blocks(handle_t *handle, struct inode *inode, +- unsigned long block, unsigned long count) +-{ +- struct super_block * sb; +- int dquot_freed_blocks; +- +- sb = inode->i_sb; +- if (!sb) { +- printk ("ext3_free_blocks: nonexistent device"); +- return; +- } +- ext3_free_blocks_sb(handle, sb, block, count, &dquot_freed_blocks); +- if (dquot_freed_blocks) +- DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); +- return; +-} +- + /* + * For ext3 allocations, we must not reuse any blocks which are + * allocated in the bitmap buffer's "last committed data" copy. This +@@ -1154,7 +1136,7 @@ + * bitmap, and then for any free bit if that fails. + * This function also updates quota and i_blocks field. + */ +-int ext3_new_block(handle_t *handle, struct inode *inode, ++int ext3_new_block_old(handle_t *handle, struct inode *inode, + unsigned long goal, int *errp) + { + struct buffer_head *bitmap_bh = NULL; +Index: linux-2.6.16.i686/fs/ext3/super.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/super.c 2006-05-30 22:55:32.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/super.c 2006-05-30 23:02:59.000000000 +0800 +@@ -392,6 +392,7 @@ + struct ext3_super_block *es = sbi->s_es; + int i; + ++ ext3_mb_release(sb); + ext3_ext_release(sb); + ext3_xattr_put_super(sb); + journal_destroy(sbi->s_journal); +@@ -640,7 +641,7 @@ + Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, + Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, + Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, +- Opt_extents, Opt_extdebug, ++ Opt_extents, Opt_extdebug, Opt_mballoc, + Opt_grpquota + }; + +@@ -694,6 +695,7 @@ + {Opt_iopen_nopriv, "iopen_nopriv"}, + {Opt_extents, "extents"}, + {Opt_extdebug, "extdebug"}, ++ {Opt_mballoc, "mballoc"}, + {Opt_barrier, "barrier=%u"}, + {Opt_err, NULL}, + {Opt_resize, "resize"}, +@@ -1041,6 +1043,9 @@ + case Opt_extdebug: + set_opt (sbi->s_mount_opt, EXTDEBUG); + break; ++ case Opt_mballoc: ++ set_opt (sbi->s_mount_opt, MBALLOC); ++ break; + default: + printk (KERN_ERR + "EXT3-fs: Unrecognized mount option \"%s\" " +@@ -1766,6 +1771,7 @@ + ext3_count_dirs(sb)); + + ext3_ext_init(sb); ++ ext3_mb_init(sb, needs_recovery); + lock_kernel(); + return 0; + +@@ -2699,7 +2705,13 @@ + + static int __init init_ext3_fs(void) + { +- int err = init_ext3_xattr(); ++ int err; ++ ++ err = init_ext3_proc(); ++ if (err) ++ return err; ++ ++ err = init_ext3_xattr(); + if (err) + return err; + err = init_inodecache(); +@@ -2721,6 +2733,7 @@ + unregister_filesystem(&ext3_fs_type); + destroy_inodecache(); + exit_ext3_xattr(); ++ exit_ext3_proc(); + } + + int ext3_prep_san_write(struct inode *inode, long *blocks, +Index: linux-2.6.16.i686/fs/ext3/Makefile +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/Makefile 2006-05-30 22:55:32.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/Makefile 2006-05-30 23:02:59.000000000 +0800 +@@ -6,7 +6,7 @@ + + ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ + ioctl.o namei.o super.o symlink.o hash.o resize.o \ +- extents.o ++ extents.o mballoc.o + + ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o + ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o +Index: linux-2.6.16.i686/include/linux/ext3_fs.h +=================================================================== +--- linux-2.6.16.i686.orig/include/linux/ext3_fs.h 2006-05-30 22:55:32.000000000 +0800 ++++ linux-2.6.16.i686/include/linux/ext3_fs.h 2006-05-30 23:02:59.000000000 +0800 +@@ -57,6 +57,14 @@ + #define ext3_debug(f, a...) do {} while (0) + #endif + ++#define EXT3_MULTIBLOCK_ALLOCATOR 1 ++ ++#define EXT3_MB_HINT_MERGE 1 ++#define EXT3_MB_HINT_RESERVED 2 ++#define EXT3_MB_HINT_METADATA 4 ++#define EXT3_MB_HINT_FIRST 8 ++#define EXT3_MB_HINT_BEST 16 ++ + /* + * Special inodes numbers + */ +@@ -383,6 +391,7 @@ + #define EXT3_MOUNT_IOPEN_NOPRIV 0x800000/* Make iopen world-readable */ + #define EXT3_MOUNT_EXTENTS 0x1000000/* Extents support */ + #define EXT3_MOUNT_EXTDEBUG 0x2000000/* Extents debug */ ++#define EXT3_MOUNT_MBALLOC 0x800000/* Buddy allocation support */ + + /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ + #ifndef clear_opt +@@ -744,7 +753,7 @@ + extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group); + extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *); + extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long, +- unsigned long); ++ unsigned long, int); + extern void ext3_free_blocks_sb (handle_t *, struct super_block *, + unsigned long, unsigned long, int *); + extern unsigned long ext3_count_free_blocks (struct super_block *); +@@ -865,6 +874,17 @@ + extern int ext3_ext_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg); + ++/* mballoc.c */ ++extern long ext3_mb_stats; ++extern long ext3_mb_max_to_scan; ++extern int ext3_mb_init(struct super_block *, int); ++extern int ext3_mb_release(struct super_block *); ++extern int ext3_mb_new_blocks(handle_t *, struct inode *, unsigned long, int *, int, int *); ++extern int ext3_mb_reserve_blocks(struct super_block *, int); ++extern void ext3_mb_release_blocks(struct super_block *, int); ++int __init init_ext3_proc(void); ++void exit_ext3_proc(void); ++ + #endif /* __KERNEL__ */ + + /* EXT3_IOC_CREATE_INUM at bottom of file (visible to kernel and user). */ +Index: linux-2.6.16.i686/include/linux/ext3_fs_sb.h +=================================================================== +--- linux-2.6.16.i686.orig/include/linux/ext3_fs_sb.h 2006-03-20 13:53:29.000000000 +0800 ++++ linux-2.6.16.i686/include/linux/ext3_fs_sb.h 2006-05-30 23:02:59.000000000 +0800 +@@ -21,8 +21,14 @@ + #include + #include + #include ++#include + #endif + #include ++#include ++ ++struct ext3_buddy_group_blocks; ++struct ext3_mb_history; ++#define EXT3_BB_MAX_BLOCKS + + /* + * third extended-fs super-block data in memory +@@ -78,6 +84,38 @@ + char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */ + int s_jquota_fmt; /* Format of quota to use */ + #endif ++ ++ /* for buddy allocator */ ++ struct ext3_group_info **s_group_info; ++ struct inode *s_buddy_cache; ++ long s_blocks_reserved; ++ spinlock_t s_reserve_lock; ++ struct list_head s_active_transaction; ++ struct list_head s_closed_transaction; ++ struct list_head s_committed_transaction; ++ spinlock_t s_md_lock; ++ tid_t s_last_transaction; ++ int s_mb_factor; ++ unsigned short *s_mb_offsets, *s_mb_maxs; ++ ++ /* history to debug policy */ ++ struct ext3_mb_history *s_mb_history; ++ int s_mb_history_cur; ++ int s_mb_history_max; ++ struct proc_dir_entry *s_mb_proc; ++ spinlock_t s_mb_history_lock; ++ ++ /* stats for buddy allocator */ ++ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ ++ atomic_t s_bal_success; /* we found long enough chunks */ ++ atomic_t s_bal_allocated; /* in blocks */ ++ atomic_t s_bal_ex_scanned; /* total extents scanned */ ++ atomic_t s_bal_goals; /* goal hits */ ++ atomic_t s_bal_breaks; /* too long searches */ ++ atomic_t s_bal_2orders; /* 2^order hits */ ++ spinlock_t s_bal_lock; ++ unsigned long s_mb_buddies_generated; ++ unsigned long long s_mb_generation_time; + }; + + #endif /* _LINUX_EXT3_FS_SB */ diff --git a/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-suse.patch b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-suse.patch index 2a64875..c77ebdd 100644 --- a/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-suse.patch +++ b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6-suse.patch @@ -1,7 +1,7 @@ -Index: linux-2.6.5-7.201/include/linux/ext3_fs.h +Index: linux-2.6.5-7.252-full/include/linux/ext3_fs.h =================================================================== ---- linux-2.6.5-7.201.orig/include/linux/ext3_fs.h 2005-12-17 02:53:30.000000000 +0300 -+++ linux-2.6.5-7.201/include/linux/ext3_fs.h 2005-12-17 03:13:38.000000000 +0300 +--- linux-2.6.5-7.252-full.orig/include/linux/ext3_fs.h 2006-04-25 17:42:19.000000000 +0400 ++++ linux-2.6.5-7.252-full/include/linux/ext3_fs.h 2006-04-26 23:40:28.000000000 +0400 @@ -57,6 +57,14 @@ struct statfs; #define ext3_debug(f, a...) do {} while (0) #endif @@ -31,8 +31,8 @@ Index: linux-2.6.5-7.201/include/linux/ext3_fs.h extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long, - unsigned long); + unsigned long, int); -+extern void ext3_free_blocks_old (handle_t *, struct inode *, unsigned long, -+ unsigned long); ++extern void ext3_free_blocks_old(handle_t *, struct inode *, unsigned long, ++ unsigned long); extern unsigned long ext3_count_free_blocks (struct super_block *); extern void ext3_check_blocks_bitmap (struct super_block *); extern struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb, @@ -54,10 +54,10 @@ Index: linux-2.6.5-7.201/include/linux/ext3_fs.h #endif /* __KERNEL__ */ #define EXT3_IOC_CREATE_INUM _IOW('f', 5, long) -Index: linux-2.6.5-7.201/include/linux/ext3_fs_sb.h +Index: linux-2.6.5-7.252-full/include/linux/ext3_fs_sb.h =================================================================== ---- linux-2.6.5-7.201.orig/include/linux/ext3_fs_sb.h 2005-12-17 02:53:25.000000000 +0300 -+++ linux-2.6.5-7.201/include/linux/ext3_fs_sb.h 2005-12-17 03:10:23.000000000 +0300 +--- linux-2.6.5-7.252-full.orig/include/linux/ext3_fs_sb.h 2006-04-25 17:42:19.000000000 +0400 ++++ linux-2.6.5-7.252-full/include/linux/ext3_fs_sb.h 2006-04-26 23:40:28.000000000 +0400 @@ -23,9 +23,15 @@ #define EXT_INCLUDE #include @@ -74,13 +74,13 @@ Index: linux-2.6.5-7.201/include/linux/ext3_fs_sb.h /* * third extended-fs super-block data in memory -@@ -78,6 +84,38 @@ struct ext3_sb_info { +@@ -78,6 +84,43 @@ struct ext3_sb_info { struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */ wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */ #endif + + /* for buddy allocator */ -+ struct ext3_group_info **s_group_info; ++ struct ext3_group_info ***s_group_info; + struct inode *s_buddy_cache; + long s_blocks_reserved; + spinlock_t s_reserve_lock; @@ -91,6 +91,7 @@ Index: linux-2.6.5-7.201/include/linux/ext3_fs_sb.h + tid_t s_last_transaction; + int s_mb_factor; + unsigned short *s_mb_offsets, *s_mb_maxs; ++ unsigned long s_stripe; + + /* history to debug policy */ + struct ext3_mb_history *s_mb_history; @@ -111,12 +112,16 @@ Index: linux-2.6.5-7.201/include/linux/ext3_fs_sb.h + unsigned long s_mb_buddies_generated; + unsigned long long s_mb_generation_time; }; ++ ++#define EXT3_GROUP_INFO(sb, group) \ ++ EXT3_SB(sb)->s_group_info[(group) >> EXT3_DESC_PER_BLOCK_BITS(sb)] \ ++ [(group) & (EXT3_DESC_PER_BLOCK(sb) - 1)] #endif /* _LINUX_EXT3_FS_SB */ -Index: linux-2.6.5-7.201/fs/ext3/super.c +Index: linux-2.6.5-7.252-full/fs/ext3/super.c =================================================================== ---- linux-2.6.5-7.201.orig/fs/ext3/super.c 2005-12-17 02:53:30.000000000 +0300 -+++ linux-2.6.5-7.201/fs/ext3/super.c 2005-12-17 03:10:23.000000000 +0300 +--- linux-2.6.5-7.252-full.orig/fs/ext3/super.c 2006-04-25 17:42:19.000000000 +0400 ++++ linux-2.6.5-7.252-full/fs/ext3/super.c 2006-04-26 23:40:28.000000000 +0400 @@ -389,6 +389,7 @@ void ext3_put_super (struct super_block struct ext3_super_block *es = sbi->s_es; int i; @@ -125,34 +130,45 @@ Index: linux-2.6.5-7.201/fs/ext3/super.c ext3_ext_release(sb); ext3_xattr_put_super(sb); journal_destroy(sbi->s_journal); -@@ -543,7 +544,7 @@ enum { - Opt_ignore, Opt_barrier, +@@ -545,6 +546,7 @@ enum { Opt_err, Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, -- Opt_extents, Opt_extdebug, -+ Opt_extents, Opt_extdebug, Opt_mballoc, + Opt_extents, Opt_noextents, Opt_extdebug, ++ Opt_mballoc, Opt_nomballoc, Opt_stripe, }; static match_table_t tokens = { -@@ -590,6 +591,7 @@ static match_table_t tokens = { - {Opt_iopen_nopriv, "iopen_nopriv"}, +@@ -591,6 +592,9 @@ static match_table_t tokens = { {Opt_extents, "extents"}, + {Opt_noextents, "noextents"}, {Opt_extdebug, "extdebug"}, + {Opt_mballoc, "mballoc"}, ++ {Opt_nomballoc, "nomballoc"}, ++ {Opt_stripe, "stripe=%u"}, {Opt_barrier, "barrier=%u"}, {Opt_err, NULL} }; -@@ -811,6 +813,9 @@ static int parse_options (char * options +@@ -813,6 +815,19 @@ static int parse_options (char * options case Opt_extdebug: set_opt (sbi->s_mount_opt, EXTDEBUG); break; + case Opt_mballoc: -+ set_opt (sbi->s_mount_opt, MBALLOC); ++ set_opt(sbi->s_mount_opt, MBALLOC); ++ break; ++ case Opt_nomballoc: ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ break; ++ case Opt_stripe: ++ if (match_int(&args[0], &option)) ++ return 0; ++ if (option < 0) ++ return 0; ++ sbi->s_stripe = option; + break; default: printk (KERN_ERR "EXT3-fs: Unrecognized mount option \"%s\" " -@@ -1464,6 +1469,7 @@ static int ext3_fill_super (struct super +@@ -1466,6 +1471,7 @@ static int ext3_fill_super (struct super ext3_count_dirs(sb)); ext3_ext_init(sb); @@ -160,7 +176,7 @@ Index: linux-2.6.5-7.201/fs/ext3/super.c return 0; -@@ -2112,7 +2118,13 @@ static struct file_system_type ext3_fs_t +@@ -2114,7 +2120,13 @@ static struct file_system_type ext3_fs_t static int __init init_ext3_fs(void) { @@ -175,7 +191,7 @@ Index: linux-2.6.5-7.201/fs/ext3/super.c if (err) return err; err = init_inodecache(); -@@ -2141,6 +2153,7 @@ static void __exit exit_ext3_fs(void) +@@ -2143,6 +2155,7 @@ static void __exit exit_ext3_fs(void) unregister_filesystem(&ext3_fs_type); destroy_inodecache(); exit_ext3_xattr(); @@ -183,11 +199,11 @@ Index: linux-2.6.5-7.201/fs/ext3/super.c } int ext3_prep_san_write(struct inode *inode, long *blocks, -Index: linux-2.6.5-7.201/fs/ext3/extents.c +Index: linux-2.6.5-7.252-full/fs/ext3/extents.c =================================================================== ---- linux-2.6.5-7.201.orig/fs/ext3/extents.c 2005-12-17 02:53:29.000000000 +0300 -+++ linux-2.6.5-7.201/fs/ext3/extents.c 2005-12-17 03:10:23.000000000 +0300 -@@ -771,7 +771,7 @@ cleanup: +--- linux-2.6.5-7.252-full.orig/fs/ext3/extents.c 2006-04-25 17:42:19.000000000 +0400 ++++ linux-2.6.5-7.252-full/fs/ext3/extents.c 2006-04-26 23:40:28.000000000 +0400 +@@ -777,7 +777,7 @@ cleanup: for (i = 0; i < depth; i++) { if (!ablocks[i]) continue; @@ -196,7 +212,7 @@ Index: linux-2.6.5-7.201/fs/ext3/extents.c } } kfree(ablocks); -@@ -1428,7 +1428,7 @@ int ext3_ext_rm_idx(handle_t *handle, st +@@ -1434,7 +1434,7 @@ int ext3_ext_rm_idx(handle_t *handle, st path->p_idx->ei_leaf); bh = sb_find_get_block(tree->inode->i_sb, path->p_idx->ei_leaf); ext3_forget(handle, 1, tree->inode, bh, path->p_idx->ei_leaf); @@ -205,7 +221,7 @@ Index: linux-2.6.5-7.201/fs/ext3/extents.c return err; } -@@ -1913,10 +1913,12 @@ ext3_remove_blocks(struct ext3_extents_t +@@ -1919,10 +1919,12 @@ ext3_remove_blocks(struct ext3_extents_t int needed = ext3_remove_blocks_credits(tree, ex, from, to); handle_t *handle = ext3_journal_start(tree->inode, needed); struct buffer_head *bh; @@ -219,7 +235,7 @@ Index: linux-2.6.5-7.201/fs/ext3/extents.c if (from >= ex->ee_block && to == ex->ee_block + ex->ee_len - 1) { /* tail removal */ unsigned long num, start; -@@ -1928,7 +1930,7 @@ ext3_remove_blocks(struct ext3_extents_t +@@ -1934,7 +1936,7 @@ ext3_remove_blocks(struct ext3_extents_t bh = sb_find_get_block(tree->inode->i_sb, start + i); ext3_forget(handle, 0, tree->inode, bh, start + i); } @@ -228,11 +244,11 @@ Index: linux-2.6.5-7.201/fs/ext3/extents.c } else if (from == ex->ee_block && to <= ex->ee_block + ex->ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", from, to, ex->ee_block, ex->ee_len); -Index: linux-2.6.5-7.201/fs/ext3/inode.c +Index: linux-2.6.5-7.252-full/fs/ext3/inode.c =================================================================== ---- linux-2.6.5-7.201.orig/fs/ext3/inode.c 2005-12-17 02:53:30.000000000 +0300 -+++ linux-2.6.5-7.201/fs/ext3/inode.c 2005-12-17 03:10:23.000000000 +0300 -@@ -572,7 +572,7 @@ static int ext3_alloc_branch(handle_t *h +--- linux-2.6.5-7.252-full.orig/fs/ext3/inode.c 2006-04-25 17:42:19.000000000 +0400 ++++ linux-2.6.5-7.252-full/fs/ext3/inode.c 2006-04-26 23:40:28.000000000 +0400 +@@ -574,7 +574,7 @@ static int ext3_alloc_branch(handle_t *h ext3_journal_forget(handle, branch[i].bh); } for (i = 0; i < keys; i++) @@ -241,7 +257,7 @@ Index: linux-2.6.5-7.201/fs/ext3/inode.c return err; } -@@ -673,7 +673,7 @@ err_out: +@@ -675,7 +675,7 @@ err_out: if (err == -EAGAIN) for (i = 0; i < num; i++) ext3_free_blocks(handle, inode, @@ -250,7 +266,7 @@ Index: linux-2.6.5-7.201/fs/ext3/inode.c return err; } -@@ -1835,7 +1835,7 @@ ext3_clear_blocks(handle_t *handle, stru +@@ -1837,7 +1837,7 @@ ext3_clear_blocks(handle_t *handle, stru } } @@ -259,7 +275,7 @@ Index: linux-2.6.5-7.201/fs/ext3/inode.c } /** -@@ -2006,7 +2006,7 @@ static void ext3_free_branches(handle_t +@@ -2008,7 +2008,7 @@ static void ext3_free_branches(handle_t ext3_journal_test_restart(handle, inode); } @@ -268,10 +284,10 @@ Index: linux-2.6.5-7.201/fs/ext3/inode.c if (parent_bh) { /* -Index: linux-2.6.5-7.201/fs/ext3/balloc.c +Index: linux-2.6.5-7.252-full/fs/ext3/balloc.c =================================================================== ---- linux-2.6.5-7.201.orig/fs/ext3/balloc.c 2005-10-11 00:12:45.000000000 +0400 -+++ linux-2.6.5-7.201/fs/ext3/balloc.c 2005-12-17 03:10:23.000000000 +0300 +--- linux-2.6.5-7.252-full.orig/fs/ext3/balloc.c 2006-02-14 15:26:58.000000000 +0300 ++++ linux-2.6.5-7.252-full/fs/ext3/balloc.c 2006-04-26 23:40:28.000000000 +0400 @@ -78,7 +78,7 @@ struct ext3_group_desc * ext3_get_group_ * * Return buffer_head on success or NULL in case of failure. @@ -299,10 +315,10 @@ Index: linux-2.6.5-7.201/fs/ext3/balloc.c unsigned long goal, int *errp) { struct buffer_head *bitmap_bh = NULL; -Index: linux-2.6.5-7.201/fs/ext3/xattr.c +Index: linux-2.6.5-7.252-full/fs/ext3/xattr.c =================================================================== ---- linux-2.6.5-7.201.orig/fs/ext3/xattr.c 2005-12-17 02:53:26.000000000 +0300 -+++ linux-2.6.5-7.201/fs/ext3/xattr.c 2005-12-17 03:10:41.000000000 +0300 +--- linux-2.6.5-7.252-full.orig/fs/ext3/xattr.c 2006-04-25 17:42:19.000000000 +0400 ++++ linux-2.6.5-7.252-full/fs/ext3/xattr.c 2006-04-26 23:40:28.000000000 +0400 @@ -1371,7 +1371,7 @@ ext3_xattr_set_handle2(handle_t *handle, new_bh = sb_getblk(sb, block); if (!new_bh) { @@ -330,11 +346,11 @@ Index: linux-2.6.5-7.201/fs/ext3/xattr.c get_bh(bh); ext3_forget(handle, 1, inode, bh, EXT3_I(inode)->i_file_acl); } else { -Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +Index: linux-2.6.5-7.252-full/fs/ext3/mballoc.c =================================================================== ---- linux-2.6.5-7.201.orig/fs/ext3/mballoc.c 2005-12-09 13:08:53.191437750 +0300 -+++ linux-2.6.5-7.201/fs/ext3/mballoc.c 2005-12-17 03:15:04.000000000 +0300 -@@ -0,0 +1,2430 @@ +--- linux-2.6.5-7.252-full.orig/fs/ext3/mballoc.c 2006-04-22 17:31:47.543334750 +0400 ++++ linux-2.6.5-7.252-full/fs/ext3/mballoc.c 2006-04-26 23:42:45.000000000 +0400 +@@ -0,0 +1,2702 @@ +/* + * Copyright (c) 2003-2005, Cluster File Systems, Inc, info@clusterfs.com + * Written by Alex Tomas @@ -423,6 +439,12 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + +long ext3_mb_stats = 1; + ++/* ++ * for which requests use 2^N search using buddies ++ */ ++long ext3_mb_order2_reqs = 8; ++ ++ +#ifdef EXT3_BB_MAX_BLOCKS +#undef EXT3_BB_MAX_BLOCKS +#endif @@ -463,10 +485,10 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + + /* search goals */ + struct ext3_free_extent ac_g_ex; -+ ++ + /* the best found extent */ + struct ext3_free_extent ac_b_ex; -+ ++ + /* number of iterations done. we have to track to limit searching */ + unsigned long ac_ex_scanned; + __u16 ac_groups_scanned; @@ -488,6 +510,8 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +struct ext3_mb_history { + struct ext3_free_extent goal; /* goal allocation */ + struct ext3_free_extent result; /* result allocation */ ++ unsigned pid; ++ unsigned ino; + __u16 found; /* how many extents have been found */ + __u16 groups; /* how many groups have been scanned */ + __u16 tail; /* what tail broke some buddy */ @@ -510,9 +534,9 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +#define EXT3_MB_BUDDY(e3b) ((e3b)->bd_buddy) + +#ifndef EXT3_MB_HISTORY -+#define ext3_mb_store_history(sb,ac) ++#define ext3_mb_store_history(sb,ino,ac) +#else -+static void ext3_mb_store_history(struct super_block *, ++static void ext3_mb_store_history(struct super_block *, unsigned ino, + struct ext3_allocation_context *ac); +#endif + @@ -631,7 +655,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + if (mb_check_counter++ % 300 != 0) + return; + } -+ ++ + while (order > 1) { + buddy = mb_find_buddy(e3b, order, &max); + J_ASSERT(buddy); @@ -812,7 +836,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + sb = inode->i_sb; + blocksize = 1 << inode->i_blkbits; + blocks_per_page = PAGE_CACHE_SIZE / blocksize; -+ ++ + groups_per_page = blocks_per_page >> 1; + if (groups_per_page == 0) + groups_per_page = 1; @@ -827,9 +851,9 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + memset(bh, 0, i); + } else + bh = &bhs; -+ ++ + first_group = page->index * blocks_per_page / 2; -+ ++ + /* read all groups the page covers into the cache */ + for (i = 0; i < groups_per_page; i++) { + struct ext3_group_desc * desc; @@ -884,11 +908,11 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + mb_debug("put buddy for group %u in page %lu/%x\n", + group, page->index, i * blocksize); + memset(data, 0xff, blocksize); -+ EXT3_SB(sb)->s_group_info[group]->bb_fragments = 0; -+ memset(EXT3_SB(sb)->s_group_info[group]->bb_counters, 0, ++ EXT3_GROUP_INFO(sb, group)->bb_fragments = 0; ++ memset(EXT3_GROUP_INFO(sb, group)->bb_counters, 0, + sizeof(unsigned short)*(sb->s_blocksize_bits+2)); + ext3_mb_generate_buddy(sb, data, bitmap, -+ EXT3_SB(sb)->s_group_info[group]); ++ EXT3_GROUP_INFO(sb, group)); + } else { + /* this is block of bitmap */ + mb_debug("put bitmap for group %u in page %lu/%x\n", @@ -921,7 +945,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; + + e3b->bd_blkbits = sb->s_blocksize_bits; -+ e3b->bd_info = sbi->s_group_info[group]; ++ e3b->bd_info = EXT3_GROUP_INFO(sb, group); + e3b->bd_sb = sb; + e3b->bd_group = group; + e3b->bd_buddy_page = NULL; @@ -997,14 +1021,14 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +ext3_lock_group(struct super_block *sb, int group) +{ + bit_spin_lock(EXT3_GROUP_INFO_LOCKED_BIT, -+ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++ &EXT3_GROUP_INFO(sb, group)->bb_state); +} + +static inline void +ext3_unlock_group(struct super_block *sb, int group) +{ + bit_spin_unlock(EXT3_GROUP_INFO_LOCKED_BIT, -+ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++ &EXT3_GROUP_INFO(sb, group)->bb_state); +} + +static int mb_find_order_for_block(struct ext3_buddy *e3b, int block) @@ -1134,7 +1158,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +static int mb_find_extent(struct ext3_buddy *e3b, int order, int block, + int needed, struct ext3_free_extent *ex) +{ -+ int next, max, ord; ++ int next = block, max, ord; + void *buddy; + + J_ASSERT(ex != NULL); @@ -1159,6 +1183,11 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + ex->fe_start = block << order; + ex->fe_group = e3b->bd_group; + ++ /* calc difference from given start */ ++ next = next - ex->fe_start; ++ ex->fe_len -= next; ++ ex->fe_start += next; ++ + while (needed > ex->fe_len && (buddy = mb_find_buddy(e3b, order, &max))) { + + if (block + 1 >= max) @@ -1354,7 +1383,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + + ext3_lock_group(ac->ac_sb, group); + max = mb_find_extent(e3b, 0, ex.fe_start, ex.fe_len, &ex); -+ ++ + if (max > 0) { + ac->ac_b_ex = ex; + ext3_mb_use_best_found(ac, e3b); @@ -1371,6 +1400,8 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + struct ext3_buddy *e3b) +{ + int group = ac->ac_g_ex.fe_group, max, err; ++ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); ++ struct ext3_super_block *es = sbi->s_es; + struct ext3_free_extent ex; + + err = ext3_mb_load_buddy(ac->ac_sb, group, e3b); @@ -1379,9 +1410,27 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + + ext3_lock_group(ac->ac_sb, group); + max = mb_find_extent(e3b, 0, ac->ac_g_ex.fe_start, -+ ac->ac_g_ex.fe_len, &ex); -+ -+ if (max > 0) { ++ ac->ac_g_ex.fe_len, &ex); ++ ++ if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) { ++ unsigned long start; ++ start = (e3b->bd_group * EXT3_BLOCKS_PER_GROUP(ac->ac_sb) + ++ ex.fe_start + le32_to_cpu(es->s_first_data_block)); ++ if (start % sbi->s_stripe == 0) { ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } ++ } else if (max >= ac->ac_g_ex.fe_len) { ++ J_ASSERT(ex.fe_len > 0); ++ J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); ++ J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } else if (max > 0 && (ac->ac_flags & EXT3_MB_HINT_MERGE)) { ++ /* Sometimes, caller may want to merge even small ++ * number of blocks to an existing extent */ + J_ASSERT(ex.fe_len > 0); + J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); + J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); @@ -1409,7 +1458,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + int i, k, max; + + J_ASSERT(ac->ac_2order > 0); -+ for (i = ac->ac_2order; i < sb->s_blocksize_bits + 1; i++) { ++ for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) { + if (grp->bb_counters[i] == 0) + continue; + @@ -1474,11 +1523,46 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + } +} + ++/* ++ * This is a special case for storages like raid5 ++ * we try to find stripe-aligned chunks for stripe-size requests ++ */ ++static void ext3_mb_scan_aligned(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct super_block *sb = ac->ac_sb; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ void *bitmap = EXT3_MB_BITMAP(e3b); ++ struct ext3_free_extent ex; ++ unsigned long i, max; ++ ++ J_ASSERT(sbi->s_stripe != 0); ++ ++ /* find first stripe-aligned block */ ++ i = e3b->bd_group * EXT3_BLOCKS_PER_GROUP(sb) ++ + le32_to_cpu(sbi->s_es->s_first_data_block); ++ i = ((i + sbi->s_stripe - 1) / sbi->s_stripe) * sbi->s_stripe; ++ i = (i - le32_to_cpu(sbi->s_es->s_first_data_block)) ++ % EXT3_BLOCKS_PER_GROUP(sb); ++ ++ while (i < sb->s_blocksize * 8) { ++ if (!mb_test_bit(i, bitmap)) { ++ max = mb_find_extent(e3b, 0, i, sbi->s_stripe, &ex); ++ if (max >= sbi->s_stripe) { ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ break; ++ } ++ } ++ i += sbi->s_stripe; ++ } ++} ++ +static int ext3_mb_good_group(struct ext3_allocation_context *ac, + int group, int cr) +{ -+ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); -+ struct ext3_group_info *grp = sbi->s_group_info[group]; ++ struct ext3_group_info *grp = EXT3_GROUP_INFO(ac->ac_sb, group); + unsigned free, fragments, i, bits; + + J_ASSERT(cr >= 0 && cr < 4); @@ -1495,15 +1579,18 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + case 0: + J_ASSERT(ac->ac_2order != 0); + bits = ac->ac_sb->s_blocksize_bits + 1; -+ for (i = ac->ac_2order; i < bits; i++) ++ for (i = ac->ac_2order; i <= bits; i++) + if (grp->bb_counters[i] > 0) + return 1; ++ break; + case 1: + if ((free / fragments) >= ac->ac_g_ex.fe_len) + return 1; ++ break; + case 2: + if (free >= ac->ac_g_ex.fe_len) + return 1; ++ break; + case 3: + return 1; + default: @@ -1604,23 +1691,27 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + ac.ac_2order = 0; + ac.ac_criteria = 0; + ++ if (*len == 1 && sbi->s_stripe) { ++ /* looks like a metadata, let's use a dirty hack for raid5 ++ * move all metadata in first groups in hope to hit cached ++ * sectors and thus avoid read-modify cycles in raid5 */ ++ ac.ac_g_ex.fe_group = group = 0; ++ } ++ + /* probably, the request is for 2^8+ blocks (1/2/3/... MB) */ + i = ffs(*len); -+ if (i >= 8) { ++ if (i >= ext3_mb_order2_reqs) { + i--; + if ((*len & (~(1 << i))) == 0) + ac.ac_2order = i; + } + -+ /* Sometimes, caller may want to merge even small -+ * number of blocks to an existing extent */ -+ if (ac.ac_flags & EXT3_MB_HINT_MERGE) { -+ err = ext3_mb_find_by_goal(&ac, &e3b); -+ if (err) -+ goto out_err; -+ if (ac.ac_status == AC_STATUS_FOUND) -+ goto found; -+ } ++ /* first, try the goal */ ++ err = ext3_mb_find_by_goal(&ac, &e3b); ++ if (err) ++ goto out_err; ++ if (ac.ac_status == AC_STATUS_FOUND) ++ goto found; + + /* Let's just scan groups to find more-less suitable blocks */ + cr = ac.ac_2order ? 0 : 1; @@ -1631,7 +1722,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + if (group == EXT3_SB(sb)->s_groups_count) + group = 0; + -+ if (EXT3_MB_GRP_NEED_INIT(sbi->s_group_info[group])) { ++ if (EXT3_MB_GRP_NEED_INIT(EXT3_GROUP_INFO(sb, group))) { + /* we need full data about the group + * to make a good selection */ + err = ext3_mb_load_buddy(ac.ac_sb, group, &e3b); @@ -1659,6 +1750,8 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + ac.ac_groups_scanned++; + if (cr == 0) + ext3_mb_simple_scan_group(&ac, &e3b); ++ else if (cr == 1 && *len == sbi->s_stripe) ++ ext3_mb_scan_aligned(&ac, &e3b); + else + ext3_mb_complex_scan_group(&ac, &e3b); + @@ -1672,7 +1765,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + } + + if (ac.ac_b_ex.fe_len > 0 && ac.ac_status != AC_STATUS_FOUND && -+ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { ++ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { + /* + * We've been searching too long. Let's try to allocate + * the best chunk we've found so far @@ -1717,8 +1810,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + sbi->s_blocks_reserved, ac.ac_found); + printk("EXT3-fs: groups: "); + for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) -+ printk("%d: %d ", i, -+ sbi->s_group_info[i]->bb_free); ++ printk("%d: %d ", i, EXT3_GROUP_INFO(sb, i)->bb_free); + printk("\n"); +#endif + goto out; @@ -1756,7 +1848,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + *errp = -EIO; + goto out_err; + } -+ ++ + err = ext3_journal_get_write_access(handle, gdp_bh); + if (err) + goto out_err; @@ -1825,7 +1917,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + * path only, here is single block always */ + ext3_mb_release_blocks(sb, 1); + } -+ ++ + if (unlikely(ext3_mb_stats) && ac.ac_g_ex.fe_len > 1) { + atomic_inc(&sbi->s_bal_reqs); + atomic_add(*len, &sbi->s_bal_allocated); @@ -1839,7 +1931,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + atomic_inc(&sbi->s_bal_breaks); + } + -+ ext3_mb_store_history(sb, &ac); ++ ext3_mb_store_history(sb, inode->i_ino, &ac); + + return block; +} @@ -1904,9 +1996,9 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + char buf[20], buf2[20]; + + if (v == SEQ_START_TOKEN) { -+ seq_printf(seq, "%-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", -+ "goal", "result", "found", "grps", "cr", "merge", -+ "tail", "broken"); ++ seq_printf(seq, "%-5s %-8s %-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", ++ "pid", "inode", "goal", "result", "found", "grps", "cr", ++ "merge", "tail", "broken"); + return 0; + } + @@ -1914,9 +2006,9 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + hs->goal.fe_start, hs->goal.fe_len); + sprintf(buf2, "%u/%u/%u", hs->result.fe_group, + hs->result.fe_start, hs->result.fe_len); -+ seq_printf(seq, "%-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", buf, -+ buf2, hs->found, hs->groups, hs->cr, -+ hs->merged ? "M" : "", hs->tail, ++ seq_printf(seq, "%-5u %-8u %-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", ++ hs->pid, hs->ino, buf, buf2, hs->found, hs->groups, ++ hs->cr, hs->merged ? "M" : "", hs->tail, + hs->buddy ? 1 << hs->buddy : 0); + return 0; +} @@ -1950,7 +2042,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + s->max = sbi->s_mb_history_max; + s->start = sbi->s_mb_history_cur % s->max; + spin_unlock(&sbi->s_mb_history_lock); -+ ++ + rc = seq_open(file, &ext3_mb_seq_history_ops); + if (rc == 0) { + struct seq_file *m = (struct seq_file *)file->private_data; @@ -1974,10 +2066,104 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + +static struct file_operations ext3_mb_seq_history_fops = { + .owner = THIS_MODULE, -+ .open = ext3_mb_seq_history_open, -+ .read = seq_read, -+ .llseek = seq_lseek, -+ .release = ext3_mb_seq_history_release, ++ .open = ext3_mb_seq_history_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = ext3_mb_seq_history_release, ++}; ++ ++static void *ext3_mb_seq_groups_start(struct seq_file *seq, loff_t *pos) ++{ ++ struct super_block *sb = seq->private; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ long group; ++ ++ if (*pos < 0 || *pos >= sbi->s_groups_count) ++ return NULL; ++ ++ group = *pos + 1; ++ return (void *) group; ++} ++ ++static void *ext3_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos) ++{ ++ struct super_block *sb = seq->private; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ long group; ++ ++ ++*pos; ++ if (*pos < 0 || *pos >= sbi->s_groups_count) ++ return NULL; ++ group = *pos + 1; ++ return (void *) group;; ++} ++ ++static int ext3_mb_seq_groups_show(struct seq_file *seq, void *v) ++{ ++ struct super_block *sb = seq->private; ++ long group = (long) v, i; ++ struct sg { ++ struct ext3_group_info info; ++ unsigned short counters[16]; ++ } sg; ++ ++ group--; ++ if (group == 0) ++ seq_printf(seq, "#%-5s: %-5s %-5s %-5s [ %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n", ++ "group", "free", "frags", "first", "2^0", "2^1", "2^2", ++ "2^3", "2^4", "2^5", "2^6", "2^7", "2^8", "2^9", "2^10", ++ "2^11", "2^12", "2^13"); ++ ++ i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) + ++ sizeof(struct ext3_group_info); ++ ext3_lock_group(sb, group); ++ memcpy(&sg, EXT3_GROUP_INFO(sb, group), i); ++ ext3_unlock_group(sb, group); ++ ++ if (EXT3_MB_GRP_NEED_INIT(&sg.info)) ++ return 0; ++ ++ seq_printf(seq, "#%-5lu: %-5u %-5u %-5u [", group, sg.info.bb_free, ++ sg.info.bb_fragments, sg.info.bb_first_free); ++ for (i = 0; i <= 13; i++) ++ seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ? ++ sg.info.bb_counters[i] : 0); ++ seq_printf(seq, " ]\n"); ++ ++ return 0; ++} ++ ++static void ext3_mb_seq_groups_stop(struct seq_file *seq, void *v) ++{ ++} ++ ++static struct seq_operations ext3_mb_seq_groups_ops = { ++ .start = ext3_mb_seq_groups_start, ++ .next = ext3_mb_seq_groups_next, ++ .stop = ext3_mb_seq_groups_stop, ++ .show = ext3_mb_seq_groups_show, ++}; ++ ++static int ext3_mb_seq_groups_open(struct inode *inode, struct file *file) ++{ ++ struct super_block *sb = PDE(inode)->data; ++ int rc; ++ ++ rc = seq_open(file, &ext3_mb_seq_groups_ops); ++ if (rc == 0) { ++ struct seq_file *m = (struct seq_file *)file->private_data; ++ m->private = sb; ++ } ++ return rc; ++ ++} ++ ++static struct file_operations ext3_mb_seq_groups_fops = { ++ .owner = THIS_MODULE, ++ .open = ext3_mb_seq_groups_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = seq_release, +}; + +static void ext3_mb_history_release(struct super_block *sb) @@ -1986,6 +2172,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + char name[64]; + + snprintf(name, sizeof(name) - 1, "%s", bdevname(sb->s_bdev, name)); ++ remove_proc_entry("mb_groups", sbi->s_mb_proc); + remove_proc_entry("mb_history", sbi->s_mb_proc); + remove_proc_entry(name, proc_root_ext3); + @@ -2008,6 +2195,11 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + p->proc_fops = &ext3_mb_seq_history_fops; + p->data = sb; + } ++ p = create_proc_entry("mb_groups", S_IRUGO, sbi->s_mb_proc); ++ if (p) { ++ p->proc_fops = &ext3_mb_seq_groups_fops; ++ p->data = sb; ++ } + } + + sbi->s_mb_history_max = 1000; @@ -2020,7 +2212,8 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +} + +static void -+ext3_mb_store_history(struct super_block *sb, struct ext3_allocation_context *ac) ++ext3_mb_store_history(struct super_block *sb, unsigned ino, ++ struct ext3_allocation_context *ac) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + struct ext3_mb_history h; @@ -2028,6 +2221,8 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + if (likely(sbi->s_mb_history == NULL)) + return; + ++ h.pid = current->pid; ++ h.ino = ino; + h.goal = ac->ac_g_ex; + h.result = ac->ac_b_ex; + h.found = ac->ac_found; @@ -2055,21 +2250,40 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +int ext3_mb_init_backend(struct super_block *sb) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); -+ int i, len; -+ -+ len = sizeof(struct ext3_buddy_group_blocks *) * sbi->s_groups_count; -+ sbi->s_group_info = kmalloc(len, GFP_KERNEL); ++ int i, j, len, metalen; ++ int num_meta_group_infos = ++ (sbi->s_groups_count + EXT3_DESC_PER_BLOCK(sb) - 1) >> ++ EXT3_DESC_PER_BLOCK_BITS(sb); ++ struct ext3_group_info **meta_group_info; ++ ++ /* An 8TB filesystem with 64-bit pointers requires a 4096 byte ++ * kmalloc. A 128kb malloc should suffice for a 256TB filesystem. ++ * So a two level scheme suffices for now. */ ++ sbi->s_group_info = kmalloc(sizeof(*sbi->s_group_info) * ++ num_meta_group_infos, GFP_KERNEL); + if (sbi->s_group_info == NULL) { -+ printk(KERN_ERR "EXT3-fs: can't allocate mem for buddy\n"); ++ printk(KERN_ERR "EXT3-fs: can't allocate buddy meta group\n"); + return -ENOMEM; + } -+ memset(sbi->s_group_info, 0, len); -+ + sbi->s_buddy_cache = new_inode(sb); + if (sbi->s_buddy_cache == NULL) { + printk(KERN_ERR "EXT3-fs: can't get new inode\n"); -+ kfree(sbi->s_group_info); -+ return -ENOMEM; ++ goto err_freesgi; ++ } ++ ++ metalen = sizeof(*meta_group_info) << EXT3_DESC_PER_BLOCK_BITS(sb); ++ for (i = 0; i < num_meta_group_infos; i++) { ++ if ((i + 1) == num_meta_group_infos) ++ metalen = sizeof(*meta_group_info) * ++ (sbi->s_groups_count - ++ (i << EXT3_DESC_PER_BLOCK_BITS(sb))); ++ meta_group_info = kmalloc(metalen, GFP_KERNEL); ++ if (meta_group_info == NULL) { ++ printk(KERN_ERR "EXT3-fs: can't allocate mem for a " ++ "buddy group\n"); ++ goto err_freemeta; ++ } ++ sbi->s_group_info[i] = meta_group_info; + } + + /* @@ -2081,30 +2295,42 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + for (i = 0; i < sbi->s_groups_count; i++) { + struct ext3_group_desc * desc; + -+ sbi->s_group_info[i] = kmalloc(len, GFP_KERNEL); -+ if (sbi->s_group_info[i] == NULL) { ++ meta_group_info = ++ sbi->s_group_info[i >> EXT3_DESC_PER_BLOCK_BITS(sb)]; ++ j = i & (EXT3_DESC_PER_BLOCK(sb) - 1); ++ ++ meta_group_info[j] = kmalloc(len, GFP_KERNEL); ++ if (meta_group_info[j] == NULL) { + printk(KERN_ERR "EXT3-fs: can't allocate buddy mem\n"); -+ goto err_out; ++ i--; ++ goto err_freebuddy; + } + desc = ext3_get_group_desc(sb, i, NULL); + if (desc == NULL) { + printk(KERN_ERR"EXT3-fs: can't read descriptor %u\n",i); -+ goto err_out; ++ goto err_freebuddy; + } -+ memset(sbi->s_group_info[i], 0, len); ++ memset(meta_group_info[j], 0, len); + set_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, -+ &sbi->s_group_info[i]->bb_state); -+ sbi->s_group_info[i]->bb_free = ++ &meta_group_info[j]->bb_state); ++ meta_group_info[j]->bb_free = + le16_to_cpu(desc->bg_free_blocks_count); + } + + return 0; + -+err_out: ++err_freebuddy: ++ while (i >= 0) { ++ kfree(EXT3_GROUP_INFO(sb, i)); ++ i--; ++ } ++ i = num_meta_group_infos; ++err_freemeta: + while (--i >= 0) + kfree(sbi->s_group_info[i]); + iput(sbi->s_buddy_cache); -+ ++err_freesgi: ++ kfree(sbi->s_group_info); + return -ENOMEM; +} + @@ -2146,7 +2372,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + max = max >> 1; + i++; + } while (i <= sb->s_blocksize_bits + 1); -+ ++ + + /* init file for buddy data */ + if ((i = ext3_mb_init_backend(sb))) { @@ -2183,8 +2409,8 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c +int ext3_mb_release(struct super_block *sb) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); -+ int i; -+ ++ int i, num_meta_group_infos; ++ + if (!test_opt(sb, MBALLOC)) + return 0; + @@ -2198,11 +2424,13 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + ext3_mb_free_committed_blocks(sb); + + if (sbi->s_group_info) { -+ for (i = 0; i < sbi->s_groups_count; i++) { -+ if (sbi->s_group_info[i] == NULL) -+ continue; ++ for (i = 0; i < sbi->s_groups_count; i++) ++ kfree(EXT3_GROUP_INFO(sb, i)); ++ num_meta_group_infos = (sbi->s_groups_count + ++ EXT3_DESC_PER_BLOCK(sb) - 1) >> ++ EXT3_DESC_PER_BLOCK_BITS(sb); ++ for (i = 0; i < num_meta_group_infos; i++) + kfree(sbi->s_group_info[i]); -+ } + kfree(sbi->s_group_info); + } + if (sbi->s_mb_offsets) @@ -2496,7 +2724,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) + count); + spin_unlock(sb_bgl_lock(sbi, block_group)); + percpu_counter_mod(&sbi->s_freeblocks_counter, count); -+ ++ + ext3_mb_release_desc(&e3b); + + *freed = count; @@ -2580,10 +2808,11 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + return; +} + -+#define EXT3_ROOT "ext3" -+#define EXT3_MB_STATS_NAME "mb_stats" ++#define EXT3_ROOT "ext3" ++#define EXT3_MB_STATS_NAME "mb_stats" +#define EXT3_MB_MAX_TO_SCAN_NAME "mb_max_to_scan" +#define EXT3_MB_MIN_TO_SCAN_NAME "mb_min_to_scan" ++#define EXT3_MB_ORDER2_REQ "mb_order2_req" + +static int ext3_mb_stats_read(char *page, char **start, off_t off, + int count, int *eof, void *data) @@ -2671,6 +2900,45 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + return len; +} + ++static int ext3_mb_order2_req_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ long value; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3-fs: %s string too long, max %u bytes\n", ++ EXT3_MB_MIN_TO_SCAN_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ value = simple_strtol(str, NULL, 0); ++ if (value <= 0) ++ return -ERANGE; ++ ++ ext3_mb_order2_reqs = value; ++ ++ return count; ++} ++ ++static int ext3_mb_order2_req_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_order2_reqs); ++ *start = page; ++ return len; ++} ++ +static int ext3_mb_min_to_scan_write(struct file *file, const char *buffer, + unsigned long count, void *data) +{ @@ -2701,6 +2969,7 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + struct proc_dir_entry *proc_ext3_mb_stats; + struct proc_dir_entry *proc_ext3_mb_max_to_scan; + struct proc_dir_entry *proc_ext3_mb_min_to_scan; ++ struct proc_dir_entry *proc_ext3_mb_order2_req; + + proc_root_ext3 = proc_mkdir(EXT3_ROOT, proc_root_fs); + if (proc_root_ext3 == NULL) { @@ -2755,6 +3024,24 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + proc_ext3_mb_min_to_scan->read_proc = ext3_mb_min_to_scan_read; + proc_ext3_mb_min_to_scan->write_proc = ext3_mb_min_to_scan_write; + ++ /* Initialize EXT3_ORDER2_REQ */ ++ proc_ext3_mb_order2_req = create_proc_entry( ++ EXT3_MB_ORDER2_REQ, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_order2_req == NULL) { ++ printk(KERN_ERR "EXT3-fs: Unable to create %s\n", ++ EXT3_MB_ORDER2_REQ); ++ remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_order2_req->data = NULL; ++ proc_ext3_mb_order2_req->read_proc = ext3_mb_order2_req_read; ++ proc_ext3_mb_order2_req->write_proc = ext3_mb_order2_req_write; ++ + return 0; +} + @@ -2763,13 +3050,14 @@ Index: linux-2.6.5-7.201/fs/ext3/mballoc.c + remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); + remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); + remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_ORDER2_REQ, proc_root_ext3); + remove_proc_entry(EXT3_ROOT, proc_root_fs); +} -Index: linux-2.6.5-7.201/fs/ext3/Makefile +Index: linux-2.6.5-7.252-full/fs/ext3/Makefile =================================================================== ---- linux-2.6.5-7.201.orig/fs/ext3/Makefile 2005-12-17 02:53:30.000000000 +0300 -+++ linux-2.6.5-7.201/fs/ext3/Makefile 2005-12-17 03:10:23.000000000 +0300 -@@ -6,7 +6,7 @@ +--- linux-2.6.5-7.252-full.orig/fs/ext3/Makefile 2006-04-25 17:42:19.000000000 +0400 ++++ linux-2.6.5-7.252-full/fs/ext3/Makefile 2006-04-26 23:40:28.000000000 +0400 +@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT3_FS) += ext3.o ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ ioctl.o namei.o super.o symlink.o hash.o \ diff --git a/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.12.patch b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.12.patch index 70f4f8a..fae9e30 100644 --- a/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.12.patch +++ b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.12.patch @@ -1,7 +1,7 @@ -Index: linux-2.6.12.6/include/linux/ext3_fs.h +Index: linux-2.6.12.6-bull/include/linux/ext3_fs.h =================================================================== ---- linux-2.6.12.6.orig/include/linux/ext3_fs.h 2005-12-17 02:17:16.000000000 +0300 -+++ linux-2.6.12.6/include/linux/ext3_fs.h 2005-12-17 02:21:21.000000000 +0300 +--- linux-2.6.12.6-bull.orig/include/linux/ext3_fs.h 2006-04-29 20:39:09.000000000 +0400 ++++ linux-2.6.12.6-bull/include/linux/ext3_fs.h 2006-04-29 20:39:10.000000000 +0400 @@ -57,6 +57,14 @@ struct statfs; #define ext3_debug(f, a...) do {} while (0) #endif @@ -52,10 +52,10 @@ Index: linux-2.6.12.6/include/linux/ext3_fs.h #endif /* __KERNEL__ */ /* EXT3_IOC_CREATE_INUM at bottom of file (visible to kernel and user). */ -Index: linux-2.6.12.6/include/linux/ext3_fs_sb.h +Index: linux-2.6.12.6-bull/include/linux/ext3_fs_sb.h =================================================================== ---- linux-2.6.12.6.orig/include/linux/ext3_fs_sb.h 2005-08-29 20:55:27.000000000 +0400 -+++ linux-2.6.12.6/include/linux/ext3_fs_sb.h 2005-12-17 02:21:21.000000000 +0300 +--- linux-2.6.12.6-bull.orig/include/linux/ext3_fs_sb.h 2005-08-29 20:55:27.000000000 +0400 ++++ linux-2.6.12.6-bull/include/linux/ext3_fs_sb.h 2006-04-29 20:39:10.000000000 +0400 @@ -21,8 +21,14 @@ #include #include @@ -71,13 +71,13 @@ Index: linux-2.6.12.6/include/linux/ext3_fs_sb.h /* * third extended-fs super-block data in memory -@@ -78,6 +84,38 @@ struct ext3_sb_info { +@@ -78,6 +84,43 @@ struct ext3_sb_info { char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */ int s_jquota_fmt; /* Format of quota to use */ #endif + + /* for buddy allocator */ -+ struct ext3_group_info **s_group_info; ++ struct ext3_group_info ***s_group_info; + struct inode *s_buddy_cache; + long s_blocks_reserved; + spinlock_t s_reserve_lock; @@ -88,6 +88,7 @@ Index: linux-2.6.12.6/include/linux/ext3_fs_sb.h + tid_t s_last_transaction; + int s_mb_factor; + unsigned short *s_mb_offsets, *s_mb_maxs; ++ unsigned long s_stripe; + + /* history to debug policy */ + struct ext3_mb_history *s_mb_history; @@ -108,12 +109,16 @@ Index: linux-2.6.12.6/include/linux/ext3_fs_sb.h + unsigned long s_mb_buddies_generated; + unsigned long long s_mb_generation_time; }; ++ ++#define EXT3_GROUP_INFO(sb, group) \ ++ EXT3_SB(sb)->s_group_info[(group) >> EXT3_DESC_PER_BLOCK_BITS(sb)] \ ++ [(group) & (EXT3_DESC_PER_BLOCK(sb) - 1)] #endif /* _LINUX_EXT3_FS_SB */ -Index: linux-2.6.12.6/fs/ext3/super.c +Index: linux-2.6.12.6-bull/fs/ext3/super.c =================================================================== ---- linux-2.6.12.6.orig/fs/ext3/super.c 2005-12-17 02:17:16.000000000 +0300 -+++ linux-2.6.12.6/fs/ext3/super.c 2005-12-17 02:21:21.000000000 +0300 +--- linux-2.6.12.6-bull.orig/fs/ext3/super.c 2006-04-29 20:39:09.000000000 +0400 ++++ linux-2.6.12.6-bull/fs/ext3/super.c 2006-04-29 20:39:10.000000000 +0400 @@ -387,6 +387,7 @@ static void ext3_put_super (struct super struct ext3_super_block *es = sbi->s_es; int i; @@ -122,34 +127,45 @@ Index: linux-2.6.12.6/fs/ext3/super.c ext3_ext_release(sb); ext3_xattr_put_super(sb); journal_destroy(sbi->s_journal); -@@ -597,7 +598,7 @@ enum { - Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, +@@ -597,6 +598,7 @@ enum { Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, -- Opt_extents, Opt_extdebug, -+ Opt_extents, Opt_extdebug, Opt_mballoc, + Opt_extents, Opt_noextents, Opt_extdebug, ++ Opt_mballoc, Opt_nomballoc, Opt_stripe, }; static match_table_t tokens = { -@@ -649,6 +651,7 @@ static match_table_t tokens = { - {Opt_iopen_nopriv, "iopen_nopriv"}, +@@ -650,6 +651,9 @@ static match_table_t tokens = { {Opt_extents, "extents"}, + {Opt_noextents, "noextents"}, {Opt_extdebug, "extdebug"}, + {Opt_mballoc, "mballoc"}, ++ {Opt_nomballoc, "nomballoc"}, ++ {Opt_stripe, "stripe=%u"}, {Opt_barrier, "barrier=%u"}, {Opt_err, NULL}, {Opt_resize, "resize"}, -@@ -964,6 +967,9 @@ clear_qf_name: +@@ -965,6 +967,19 @@ clear_qf_name: case Opt_extdebug: set_opt (sbi->s_mount_opt, EXTDEBUG); break; + case Opt_mballoc: -+ set_opt (sbi->s_mount_opt, MBALLOC); ++ set_opt(sbi->s_mount_opt, MBALLOC); ++ break; ++ case Opt_nomballoc: ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ break; ++ case Opt_stripe: ++ if (match_int(&args[0], &option)) ++ return 0; ++ if (option < 0) ++ return 0; ++ sbi->s_stripe = option; + break; default: printk (KERN_ERR "EXT3-fs: Unrecognized mount option \"%s\" " -@@ -1669,6 +1675,7 @@ static int ext3_fill_super (struct super +@@ -1670,6 +1675,7 @@ static int ext3_fill_super (struct super ext3_count_dirs(sb)); ext3_ext_init(sb); @@ -157,7 +173,7 @@ Index: linux-2.6.12.6/fs/ext3/super.c lock_kernel(); return 0; -@@ -2548,7 +2555,13 @@ static struct file_system_type ext3_fs_t +@@ -2549,7 +2555,13 @@ static struct file_system_type ext3_fs_t static int __init init_ext3_fs(void) { @@ -172,7 +188,7 @@ Index: linux-2.6.12.6/fs/ext3/super.c if (err) return err; err = init_inodecache(); -@@ -2570,6 +2583,7 @@ static void __exit exit_ext3_fs(void) +@@ -2571,6 +2583,7 @@ static void __exit exit_ext3_fs(void) unregister_filesystem(&ext3_fs_type); destroy_inodecache(); exit_ext3_xattr(); @@ -180,11 +196,11 @@ Index: linux-2.6.12.6/fs/ext3/super.c } int ext3_prep_san_write(struct inode *inode, long *blocks, -Index: linux-2.6.12.6/fs/ext3/extents.c +Index: linux-2.6.12.6-bull/fs/ext3/extents.c =================================================================== ---- linux-2.6.12.6.orig/fs/ext3/extents.c 2005-12-17 02:17:16.000000000 +0300 -+++ linux-2.6.12.6/fs/ext3/extents.c 2005-12-17 02:21:21.000000000 +0300 -@@ -771,7 +771,7 @@ cleanup: +--- linux-2.6.12.6-bull.orig/fs/ext3/extents.c 2006-04-29 20:39:09.000000000 +0400 ++++ linux-2.6.12.6-bull/fs/ext3/extents.c 2006-04-29 20:39:10.000000000 +0400 +@@ -777,7 +777,7 @@ cleanup: for (i = 0; i < depth; i++) { if (!ablocks[i]) continue; @@ -193,7 +209,7 @@ Index: linux-2.6.12.6/fs/ext3/extents.c } } kfree(ablocks); -@@ -1428,7 +1428,7 @@ int ext3_ext_rm_idx(handle_t *handle, st +@@ -1434,7 +1434,7 @@ int ext3_ext_rm_idx(handle_t *handle, st path->p_idx->ei_leaf); bh = sb_find_get_block(tree->inode->i_sb, path->p_idx->ei_leaf); ext3_forget(handle, 1, tree->inode, bh, path->p_idx->ei_leaf); @@ -202,7 +218,7 @@ Index: linux-2.6.12.6/fs/ext3/extents.c return err; } -@@ -1913,10 +1913,12 @@ ext3_remove_blocks(struct ext3_extents_t +@@ -1919,10 +1919,12 @@ ext3_remove_blocks(struct ext3_extents_t int needed = ext3_remove_blocks_credits(tree, ex, from, to); handle_t *handle = ext3_journal_start(tree->inode, needed); struct buffer_head *bh; @@ -216,7 +232,7 @@ Index: linux-2.6.12.6/fs/ext3/extents.c if (from >= ex->ee_block && to == ex->ee_block + ex->ee_len - 1) { /* tail removal */ unsigned long num, start; -@@ -1928,7 +1930,7 @@ ext3_remove_blocks(struct ext3_extents_t +@@ -1934,7 +1936,7 @@ ext3_remove_blocks(struct ext3_extents_t bh = sb_find_get_block(tree->inode->i_sb, start + i); ext3_forget(handle, 0, tree->inode, bh, start + i); } @@ -225,10 +241,10 @@ Index: linux-2.6.12.6/fs/ext3/extents.c } else if (from == ex->ee_block && to <= ex->ee_block + ex->ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", from, to, ex->ee_block, ex->ee_len); -Index: linux-2.6.12.6/fs/ext3/inode.c +Index: linux-2.6.12.6-bull/fs/ext3/inode.c =================================================================== ---- linux-2.6.12.6.orig/fs/ext3/inode.c 2005-12-17 02:17:16.000000000 +0300 -+++ linux-2.6.12.6/fs/ext3/inode.c 2005-12-17 02:21:21.000000000 +0300 +--- linux-2.6.12.6-bull.orig/fs/ext3/inode.c 2006-04-29 20:39:09.000000000 +0400 ++++ linux-2.6.12.6-bull/fs/ext3/inode.c 2006-04-29 20:39:10.000000000 +0400 @@ -564,7 +564,7 @@ static int ext3_alloc_branch(handle_t *h ext3_journal_forget(handle, branch[i].bh); } @@ -256,10 +272,10 @@ Index: linux-2.6.12.6/fs/ext3/inode.c if (parent_bh) { /* -Index: linux-2.6.12.6/fs/ext3/balloc.c +Index: linux-2.6.12.6-bull/fs/ext3/balloc.c =================================================================== ---- linux-2.6.12.6.orig/fs/ext3/balloc.c 2005-08-29 20:55:27.000000000 +0400 -+++ linux-2.6.12.6/fs/ext3/balloc.c 2005-12-17 02:21:21.000000000 +0300 +--- linux-2.6.12.6-bull.orig/fs/ext3/balloc.c 2005-08-29 20:55:27.000000000 +0400 ++++ linux-2.6.12.6-bull/fs/ext3/balloc.c 2006-04-29 20:39:10.000000000 +0400 @@ -79,7 +79,7 @@ struct ext3_group_desc * ext3_get_group_ * * Return buffer_head on success or NULL in case of failure. @@ -303,10 +319,10 @@ Index: linux-2.6.12.6/fs/ext3/balloc.c unsigned long goal, int *errp) { struct buffer_head *bitmap_bh = NULL; -Index: linux-2.6.12.6/fs/ext3/xattr.c +Index: linux-2.6.12.6-bull/fs/ext3/xattr.c =================================================================== ---- linux-2.6.12.6.orig/fs/ext3/xattr.c 2005-08-29 20:55:27.000000000 +0400 -+++ linux-2.6.12.6/fs/ext3/xattr.c 2005-12-17 02:21:33.000000000 +0300 +--- linux-2.6.12.6-bull.orig/fs/ext3/xattr.c 2005-08-29 20:55:27.000000000 +0400 ++++ linux-2.6.12.6-bull/fs/ext3/xattr.c 2006-04-29 20:39:10.000000000 +0400 @@ -484,7 +484,7 @@ ext3_xattr_release_block(handle_t *handl ea_bdebug(bh, "refcount now=0; freeing"); if (ce) @@ -325,11 +341,11 @@ Index: linux-2.6.12.6/fs/ext3/xattr.c error = -EIO; goto cleanup; } -Index: linux-2.6.12.6/fs/ext3/mballoc.c +Index: linux-2.6.12.6-bull/fs/ext3/mballoc.c =================================================================== ---- linux-2.6.12.6.orig/fs/ext3/mballoc.c 2005-12-09 13:08:53.191437750 +0300 -+++ linux-2.6.12.6/fs/ext3/mballoc.c 2005-12-17 02:21:21.000000000 +0300 -@@ -0,0 +1,2429 @@ +--- linux-2.6.12.6-bull.orig/fs/ext3/mballoc.c 2006-04-22 17:31:47.543334750 +0400 ++++ linux-2.6.12.6-bull/fs/ext3/mballoc.c 2006-04-30 01:24:11.000000000 +0400 +@@ -0,0 +1,2701 @@ +/* + * Copyright (c) 2003-2005, Cluster File Systems, Inc, info@clusterfs.com + * Written by Alex Tomas @@ -418,6 +434,12 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + +long ext3_mb_stats = 1; + ++/* ++ * for which requests use 2^N search using buddies ++ */ ++long ext3_mb_order2_reqs = 8; ++ ++ +#ifdef EXT3_BB_MAX_BLOCKS +#undef EXT3_BB_MAX_BLOCKS +#endif @@ -458,10 +480,10 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + + /* search goals */ + struct ext3_free_extent ac_g_ex; -+ ++ + /* the best found extent */ + struct ext3_free_extent ac_b_ex; -+ ++ + /* number of iterations done. we have to track to limit searching */ + unsigned long ac_ex_scanned; + __u16 ac_groups_scanned; @@ -483,6 +505,8 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c +struct ext3_mb_history { + struct ext3_free_extent goal; /* goal allocation */ + struct ext3_free_extent result; /* result allocation */ ++ unsigned pid; ++ unsigned ino; + __u16 found; /* how many extents have been found */ + __u16 groups; /* how many groups have been scanned */ + __u16 tail; /* what tail broke some buddy */ @@ -505,9 +529,9 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c +#define EXT3_MB_BUDDY(e3b) ((e3b)->bd_buddy) + +#ifndef EXT3_MB_HISTORY -+#define ext3_mb_store_history(sb,ac) ++#define ext3_mb_store_history(sb,ino,ac) +#else -+static void ext3_mb_store_history(struct super_block *, ++static void ext3_mb_store_history(struct super_block *, unsigned ino, + struct ext3_allocation_context *ac); +#endif + @@ -626,7 +650,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + if (mb_check_counter++ % 300 != 0) + return; + } -+ ++ + while (order > 1) { + buddy = mb_find_buddy(e3b, order, &max); + J_ASSERT(buddy); @@ -807,7 +831,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + sb = inode->i_sb; + blocksize = 1 << inode->i_blkbits; + blocks_per_page = PAGE_CACHE_SIZE / blocksize; -+ ++ + groups_per_page = blocks_per_page >> 1; + if (groups_per_page == 0) + groups_per_page = 1; @@ -822,9 +846,9 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + memset(bh, 0, i); + } else + bh = &bhs; -+ ++ + first_group = page->index * blocks_per_page / 2; -+ ++ + /* read all groups the page covers into the cache */ + for (i = 0; i < groups_per_page; i++) { + struct ext3_group_desc * desc; @@ -879,11 +903,11 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + mb_debug("put buddy for group %u in page %lu/%x\n", + group, page->index, i * blocksize); + memset(data, 0xff, blocksize); -+ EXT3_SB(sb)->s_group_info[group]->bb_fragments = 0; -+ memset(EXT3_SB(sb)->s_group_info[group]->bb_counters, 0, ++ EXT3_GROUP_INFO(sb, group)->bb_fragments = 0; ++ memset(EXT3_GROUP_INFO(sb, group)->bb_counters, 0, + sizeof(unsigned short)*(sb->s_blocksize_bits+2)); + ext3_mb_generate_buddy(sb, data, bitmap, -+ EXT3_SB(sb)->s_group_info[group]); ++ EXT3_GROUP_INFO(sb, group)); + } else { + /* this is block of bitmap */ + mb_debug("put bitmap for group %u in page %lu/%x\n", @@ -916,7 +940,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; + + e3b->bd_blkbits = sb->s_blocksize_bits; -+ e3b->bd_info = sbi->s_group_info[group]; ++ e3b->bd_info = EXT3_GROUP_INFO(sb, group); + e3b->bd_sb = sb; + e3b->bd_group = group; + e3b->bd_buddy_page = NULL; @@ -992,14 +1016,14 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c +ext3_lock_group(struct super_block *sb, int group) +{ + bit_spin_lock(EXT3_GROUP_INFO_LOCKED_BIT, -+ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++ &EXT3_GROUP_INFO(sb, group)->bb_state); +} + +static inline void +ext3_unlock_group(struct super_block *sb, int group) +{ + bit_spin_unlock(EXT3_GROUP_INFO_LOCKED_BIT, -+ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++ &EXT3_GROUP_INFO(sb, group)->bb_state); +} + +static int mb_find_order_for_block(struct ext3_buddy *e3b, int block) @@ -1129,7 +1153,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c +static int mb_find_extent(struct ext3_buddy *e3b, int order, int block, + int needed, struct ext3_free_extent *ex) +{ -+ int next, max, ord; ++ int next = block, max, ord; + void *buddy; + + J_ASSERT(ex != NULL); @@ -1154,6 +1178,11 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + ex->fe_start = block << order; + ex->fe_group = e3b->bd_group; + ++ /* calc difference from given start */ ++ next = next - ex->fe_start; ++ ex->fe_len -= next; ++ ex->fe_start += next; ++ + while (needed > ex->fe_len && (buddy = mb_find_buddy(e3b, order, &max))) { + + if (block + 1 >= max) @@ -1349,7 +1378,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + + ext3_lock_group(ac->ac_sb, group); + max = mb_find_extent(e3b, 0, ex.fe_start, ex.fe_len, &ex); -+ ++ + if (max > 0) { + ac->ac_b_ex = ex; + ext3_mb_use_best_found(ac, e3b); @@ -1366,6 +1395,8 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + struct ext3_buddy *e3b) +{ + int group = ac->ac_g_ex.fe_group, max, err; ++ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); ++ struct ext3_super_block *es = sbi->s_es; + struct ext3_free_extent ex; + + err = ext3_mb_load_buddy(ac->ac_sb, group, e3b); @@ -1374,9 +1405,27 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + + ext3_lock_group(ac->ac_sb, group); + max = mb_find_extent(e3b, 0, ac->ac_g_ex.fe_start, -+ ac->ac_g_ex.fe_len, &ex); -+ -+ if (max > 0) { ++ ac->ac_g_ex.fe_len, &ex); ++ ++ if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) { ++ unsigned long start; ++ start = (e3b->bd_group * EXT3_BLOCKS_PER_GROUP(ac->ac_sb) + ++ ex.fe_start + le32_to_cpu(es->s_first_data_block)); ++ if (start % sbi->s_stripe == 0) { ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } ++ } else if (max >= ac->ac_g_ex.fe_len) { ++ J_ASSERT(ex.fe_len > 0); ++ J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); ++ J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } else if (max > 0 && (ac->ac_flags & EXT3_MB_HINT_MERGE)) { ++ /* Sometimes, caller may want to merge even small ++ * number of blocks to an existing extent */ + J_ASSERT(ex.fe_len > 0); + J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); + J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); @@ -1404,7 +1453,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + int i, k, max; + + J_ASSERT(ac->ac_2order > 0); -+ for (i = ac->ac_2order; i < sb->s_blocksize_bits + 1; i++) { ++ for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) { + if (grp->bb_counters[i] == 0) + continue; + @@ -1469,11 +1518,46 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + } +} + ++/* ++ * This is a special case for storages like raid5 ++ * we try to find stripe-aligned chunks for stripe-size requests ++ */ ++static void ext3_mb_scan_aligned(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct super_block *sb = ac->ac_sb; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ void *bitmap = EXT3_MB_BITMAP(e3b); ++ struct ext3_free_extent ex; ++ unsigned long i, max; ++ ++ J_ASSERT(sbi->s_stripe != 0); ++ ++ /* find first stripe-aligned block */ ++ i = e3b->bd_group * EXT3_BLOCKS_PER_GROUP(sb) ++ + le32_to_cpu(sbi->s_es->s_first_data_block); ++ i = ((i + sbi->s_stripe - 1) / sbi->s_stripe) * sbi->s_stripe; ++ i = (i - le32_to_cpu(sbi->s_es->s_first_data_block)) ++ % EXT3_BLOCKS_PER_GROUP(sb); ++ ++ while (i < sb->s_blocksize * 8) { ++ if (!mb_test_bit(i, bitmap)) { ++ max = mb_find_extent(e3b, 0, i, sbi->s_stripe, &ex); ++ if (max >= sbi->s_stripe) { ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ break; ++ } ++ } ++ i += sbi->s_stripe; ++ } ++} ++ +static int ext3_mb_good_group(struct ext3_allocation_context *ac, + int group, int cr) +{ -+ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); -+ struct ext3_group_info *grp = sbi->s_group_info[group]; ++ struct ext3_group_info *grp = EXT3_GROUP_INFO(ac->ac_sb, group); + unsigned free, fragments, i, bits; + + J_ASSERT(cr >= 0 && cr < 4); @@ -1490,15 +1574,18 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + case 0: + J_ASSERT(ac->ac_2order != 0); + bits = ac->ac_sb->s_blocksize_bits + 1; -+ for (i = ac->ac_2order; i < bits; i++) ++ for (i = ac->ac_2order; i <= bits; i++) + if (grp->bb_counters[i] > 0) + return 1; ++ break; + case 1: + if ((free / fragments) >= ac->ac_g_ex.fe_len) + return 1; ++ break; + case 2: + if (free >= ac->ac_g_ex.fe_len) + return 1; ++ break; + case 3: + return 1; + default: @@ -1599,23 +1686,27 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + ac.ac_2order = 0; + ac.ac_criteria = 0; + ++ if (*len == 1 && sbi->s_stripe) { ++ /* looks like a metadata, let's use a dirty hack for raid5 ++ * move all metadata in first groups in hope to hit cached ++ * sectors and thus avoid read-modify cycles in raid5 */ ++ ac.ac_g_ex.fe_group = group = 0; ++ } ++ + /* probably, the request is for 2^8+ blocks (1/2/3/... MB) */ + i = ffs(*len); -+ if (i >= 8) { ++ if (i >= ext3_mb_order2_reqs) { + i--; + if ((*len & (~(1 << i))) == 0) + ac.ac_2order = i; + } + -+ /* Sometimes, caller may want to merge even small -+ * number of blocks to an existing extent */ -+ if (ac.ac_flags & EXT3_MB_HINT_MERGE) { -+ err = ext3_mb_find_by_goal(&ac, &e3b); -+ if (err) -+ goto out_err; -+ if (ac.ac_status == AC_STATUS_FOUND) -+ goto found; -+ } ++ /* first, try the goal */ ++ err = ext3_mb_find_by_goal(&ac, &e3b); ++ if (err) ++ goto out_err; ++ if (ac.ac_status == AC_STATUS_FOUND) ++ goto found; + + /* Let's just scan groups to find more-less suitable blocks */ + cr = ac.ac_2order ? 0 : 1; @@ -1626,7 +1717,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + if (group == EXT3_SB(sb)->s_groups_count) + group = 0; + -+ if (EXT3_MB_GRP_NEED_INIT(sbi->s_group_info[group])) { ++ if (EXT3_MB_GRP_NEED_INIT(EXT3_GROUP_INFO(sb, group))) { + /* we need full data about the group + * to make a good selection */ + err = ext3_mb_load_buddy(ac.ac_sb, group, &e3b); @@ -1654,6 +1745,8 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + ac.ac_groups_scanned++; + if (cr == 0) + ext3_mb_simple_scan_group(&ac, &e3b); ++ else if (cr == 1 && *len == sbi->s_stripe) ++ ext3_mb_scan_aligned(&ac, &e3b); + else + ext3_mb_complex_scan_group(&ac, &e3b); + @@ -1667,7 +1760,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + } + + if (ac.ac_b_ex.fe_len > 0 && ac.ac_status != AC_STATUS_FOUND && -+ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { ++ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { + /* + * We've been searching too long. Let's try to allocate + * the best chunk we've found so far @@ -1712,8 +1805,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + sbi->s_blocks_reserved, ac.ac_found); + printk("EXT3-fs: groups: "); + for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) -+ printk("%d: %d ", i, -+ sbi->s_group_info[i]->bb_free); ++ printk("%d: %d ", i, EXT3_GROUP_INFO(sb, i)->bb_free); + printk("\n"); +#endif + goto out; @@ -1751,7 +1843,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + *errp = -EIO; + goto out_err; + } -+ ++ + err = ext3_journal_get_write_access(handle, gdp_bh); + if (err) + goto out_err; @@ -1820,7 +1912,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + * path only, here is single block always */ + ext3_mb_release_blocks(sb, 1); + } -+ ++ + if (unlikely(ext3_mb_stats) && ac.ac_g_ex.fe_len > 1) { + atomic_inc(&sbi->s_bal_reqs); + atomic_add(*len, &sbi->s_bal_allocated); @@ -1834,7 +1926,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + atomic_inc(&sbi->s_bal_breaks); + } + -+ ext3_mb_store_history(sb, &ac); ++ ext3_mb_store_history(sb, inode->i_ino, &ac); + + return block; +} @@ -1899,9 +1991,9 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + char buf[20], buf2[20]; + + if (v == SEQ_START_TOKEN) { -+ seq_printf(seq, "%-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", -+ "goal", "result", "found", "grps", "cr", "merge", -+ "tail", "broken"); ++ seq_printf(seq, "%-5s %-8s %-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", ++ "pid", "inode", "goal", "result", "found", "grps", "cr", ++ "merge", "tail", "broken"); + return 0; + } + @@ -1909,9 +2001,9 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + hs->goal.fe_start, hs->goal.fe_len); + sprintf(buf2, "%u/%u/%u", hs->result.fe_group, + hs->result.fe_start, hs->result.fe_len); -+ seq_printf(seq, "%-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", buf, -+ buf2, hs->found, hs->groups, hs->cr, -+ hs->merged ? "M" : "", hs->tail, ++ seq_printf(seq, "%-5u %-8u %-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", ++ hs->pid, hs->ino, buf, buf2, hs->found, hs->groups, ++ hs->cr, hs->merged ? "M" : "", hs->tail, + hs->buddy ? 1 << hs->buddy : 0); + return 0; +} @@ -1945,7 +2037,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + s->max = sbi->s_mb_history_max; + s->start = sbi->s_mb_history_cur % s->max; + spin_unlock(&sbi->s_mb_history_lock); -+ ++ + rc = seq_open(file, &ext3_mb_seq_history_ops); + if (rc == 0) { + struct seq_file *m = (struct seq_file *)file->private_data; @@ -1969,10 +2061,104 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + +static struct file_operations ext3_mb_seq_history_fops = { + .owner = THIS_MODULE, -+ .open = ext3_mb_seq_history_open, -+ .read = seq_read, -+ .llseek = seq_lseek, -+ .release = ext3_mb_seq_history_release, ++ .open = ext3_mb_seq_history_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = ext3_mb_seq_history_release, ++}; ++ ++static void *ext3_mb_seq_groups_start(struct seq_file *seq, loff_t *pos) ++{ ++ struct super_block *sb = seq->private; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ long group; ++ ++ if (*pos < 0 || *pos >= sbi->s_groups_count) ++ return NULL; ++ ++ group = *pos + 1; ++ return (void *) group; ++} ++ ++static void *ext3_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos) ++{ ++ struct super_block *sb = seq->private; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ long group; ++ ++ ++*pos; ++ if (*pos < 0 || *pos >= sbi->s_groups_count) ++ return NULL; ++ group = *pos + 1; ++ return (void *) group;; ++} ++ ++static int ext3_mb_seq_groups_show(struct seq_file *seq, void *v) ++{ ++ struct super_block *sb = seq->private; ++ long group = (long) v, i; ++ struct sg { ++ struct ext3_group_info info; ++ unsigned short counters[16]; ++ } sg; ++ ++ group--; ++ if (group == 0) ++ seq_printf(seq, "#%-5s: %-5s %-5s %-5s [ %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n", ++ "group", "free", "frags", "first", "2^0", "2^1", "2^2", ++ "2^3", "2^4", "2^5", "2^6", "2^7", "2^8", "2^9", "2^10", ++ "2^11", "2^12", "2^13"); ++ ++ i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) + ++ sizeof(struct ext3_group_info); ++ ext3_lock_group(sb, group); ++ memcpy(&sg, EXT3_GROUP_INFO(sb, group), i); ++ ext3_unlock_group(sb, group); ++ ++ if (EXT3_MB_GRP_NEED_INIT(&sg.info)) ++ return 0; ++ ++ seq_printf(seq, "#%-5lu: %-5u %-5u %-5u [", group, sg.info.bb_free, ++ sg.info.bb_fragments, sg.info.bb_first_free); ++ for (i = 0; i <= 13; i++) ++ seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ? ++ sg.info.bb_counters[i] : 0); ++ seq_printf(seq, " ]\n"); ++ ++ return 0; ++} ++ ++static void ext3_mb_seq_groups_stop(struct seq_file *seq, void *v) ++{ ++} ++ ++static struct seq_operations ext3_mb_seq_groups_ops = { ++ .start = ext3_mb_seq_groups_start, ++ .next = ext3_mb_seq_groups_next, ++ .stop = ext3_mb_seq_groups_stop, ++ .show = ext3_mb_seq_groups_show, ++}; ++ ++static int ext3_mb_seq_groups_open(struct inode *inode, struct file *file) ++{ ++ struct super_block *sb = PDE(inode)->data; ++ int rc; ++ ++ rc = seq_open(file, &ext3_mb_seq_groups_ops); ++ if (rc == 0) { ++ struct seq_file *m = (struct seq_file *)file->private_data; ++ m->private = sb; ++ } ++ return rc; ++ ++} ++ ++static struct file_operations ext3_mb_seq_groups_fops = { ++ .owner = THIS_MODULE, ++ .open = ext3_mb_seq_groups_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = seq_release, +}; + +static void ext3_mb_history_release(struct super_block *sb) @@ -1981,6 +2167,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + char name[64]; + + snprintf(name, sizeof(name) - 1, "%s", bdevname(sb->s_bdev, name)); ++ remove_proc_entry("mb_groups", sbi->s_mb_proc); + remove_proc_entry("mb_history", sbi->s_mb_proc); + remove_proc_entry(name, proc_root_ext3); + @@ -2003,6 +2190,11 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + p->proc_fops = &ext3_mb_seq_history_fops; + p->data = sb; + } ++ p = create_proc_entry("mb_groups", S_IRUGO, sbi->s_mb_proc); ++ if (p) { ++ p->proc_fops = &ext3_mb_seq_groups_fops; ++ p->data = sb; ++ } + } + + sbi->s_mb_history_max = 1000; @@ -2015,7 +2207,8 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c +} + +static void -+ext3_mb_store_history(struct super_block *sb, struct ext3_allocation_context *ac) ++ext3_mb_store_history(struct super_block *sb, unsigned ino, ++ struct ext3_allocation_context *ac) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + struct ext3_mb_history h; @@ -2023,6 +2216,8 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + if (likely(sbi->s_mb_history == NULL)) + return; + ++ h.pid = current->pid; ++ h.ino = ino; + h.goal = ac->ac_g_ex; + h.result = ac->ac_b_ex; + h.found = ac->ac_found; @@ -2050,21 +2245,40 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c +int ext3_mb_init_backend(struct super_block *sb) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); -+ int i, len; -+ -+ len = sizeof(struct ext3_buddy_group_blocks *) * sbi->s_groups_count; -+ sbi->s_group_info = kmalloc(len, GFP_KERNEL); ++ int i, j, len, metalen; ++ int num_meta_group_infos = ++ (sbi->s_groups_count + EXT3_DESC_PER_BLOCK(sb) - 1) >> ++ EXT3_DESC_PER_BLOCK_BITS(sb); ++ struct ext3_group_info **meta_group_info; ++ ++ /* An 8TB filesystem with 64-bit pointers requires a 4096 byte ++ * kmalloc. A 128kb malloc should suffice for a 256TB filesystem. ++ * So a two level scheme suffices for now. */ ++ sbi->s_group_info = kmalloc(sizeof(*sbi->s_group_info) * ++ num_meta_group_infos, GFP_KERNEL); + if (sbi->s_group_info == NULL) { -+ printk(KERN_ERR "EXT3-fs: can't allocate mem for buddy\n"); ++ printk(KERN_ERR "EXT3-fs: can't allocate buddy meta group\n"); + return -ENOMEM; + } -+ memset(sbi->s_group_info, 0, len); -+ + sbi->s_buddy_cache = new_inode(sb); + if (sbi->s_buddy_cache == NULL) { + printk(KERN_ERR "EXT3-fs: can't get new inode\n"); -+ kfree(sbi->s_group_info); -+ return -ENOMEM; ++ goto err_freesgi; ++ } ++ ++ metalen = sizeof(*meta_group_info) << EXT3_DESC_PER_BLOCK_BITS(sb); ++ for (i = 0; i < num_meta_group_infos; i++) { ++ if ((i + 1) == num_meta_group_infos) ++ metalen = sizeof(*meta_group_info) * ++ (sbi->s_groups_count - ++ (i << EXT3_DESC_PER_BLOCK_BITS(sb))); ++ meta_group_info = kmalloc(metalen, GFP_KERNEL); ++ if (meta_group_info == NULL) { ++ printk(KERN_ERR "EXT3-fs: can't allocate mem for a " ++ "buddy group\n"); ++ goto err_freemeta; ++ } ++ sbi->s_group_info[i] = meta_group_info; + } + + /* @@ -2076,30 +2290,42 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + for (i = 0; i < sbi->s_groups_count; i++) { + struct ext3_group_desc * desc; + -+ sbi->s_group_info[i] = kmalloc(len, GFP_KERNEL); -+ if (sbi->s_group_info[i] == NULL) { ++ meta_group_info = ++ sbi->s_group_info[i >> EXT3_DESC_PER_BLOCK_BITS(sb)]; ++ j = i & (EXT3_DESC_PER_BLOCK(sb) - 1); ++ ++ meta_group_info[j] = kmalloc(len, GFP_KERNEL); ++ if (meta_group_info[j] == NULL) { + printk(KERN_ERR "EXT3-fs: can't allocate buddy mem\n"); -+ goto err_out; ++ i--; ++ goto err_freebuddy; + } + desc = ext3_get_group_desc(sb, i, NULL); + if (desc == NULL) { + printk(KERN_ERR"EXT3-fs: can't read descriptor %u\n",i); -+ goto err_out; ++ goto err_freebuddy; + } -+ memset(sbi->s_group_info[i], 0, len); ++ memset(meta_group_info[j], 0, len); + set_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, -+ &sbi->s_group_info[i]->bb_state); -+ sbi->s_group_info[i]->bb_free = ++ &meta_group_info[j]->bb_state); ++ meta_group_info[j]->bb_free = + le16_to_cpu(desc->bg_free_blocks_count); + } + + return 0; + -+err_out: ++err_freebuddy: ++ while (i >= 0) { ++ kfree(EXT3_GROUP_INFO(sb, i)); ++ i--; ++ } ++ i = num_meta_group_infos; ++err_freemeta: + while (--i >= 0) + kfree(sbi->s_group_info[i]); + iput(sbi->s_buddy_cache); -+ ++err_freesgi: ++ kfree(sbi->s_group_info); + return -ENOMEM; +} + @@ -2141,7 +2367,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + max = max >> 1; + i++; + } while (i <= sb->s_blocksize_bits + 1); -+ ++ + + /* init file for buddy data */ + if ((i = ext3_mb_init_backend(sb))) { @@ -2178,8 +2404,8 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c +int ext3_mb_release(struct super_block *sb) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); -+ int i; -+ ++ int i, num_meta_group_infos; ++ + if (!test_opt(sb, MBALLOC)) + return 0; + @@ -2193,11 +2419,13 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + ext3_mb_free_committed_blocks(sb); + + if (sbi->s_group_info) { -+ for (i = 0; i < sbi->s_groups_count; i++) { -+ if (sbi->s_group_info[i] == NULL) -+ continue; ++ for (i = 0; i < sbi->s_groups_count; i++) ++ kfree(EXT3_GROUP_INFO(sb, i)); ++ num_meta_group_infos = (sbi->s_groups_count + ++ EXT3_DESC_PER_BLOCK(sb) - 1) >> ++ EXT3_DESC_PER_BLOCK_BITS(sb); ++ for (i = 0; i < num_meta_group_infos; i++) + kfree(sbi->s_group_info[i]); -+ } + kfree(sbi->s_group_info); + } + if (sbi->s_mb_offsets) @@ -2491,7 +2719,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) + count); + spin_unlock(sb_bgl_lock(sbi, block_group)); + percpu_counter_mod(&sbi->s_freeblocks_counter, count); -+ ++ + ext3_mb_release_desc(&e3b); + + *freed = count; @@ -2574,10 +2802,11 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + return; +} + -+#define EXT3_ROOT "ext3" -+#define EXT3_MB_STATS_NAME "mb_stats" ++#define EXT3_ROOT "ext3" ++#define EXT3_MB_STATS_NAME "mb_stats" +#define EXT3_MB_MAX_TO_SCAN_NAME "mb_max_to_scan" +#define EXT3_MB_MIN_TO_SCAN_NAME "mb_min_to_scan" ++#define EXT3_MB_ORDER2_REQ "mb_order2_req" + +static int ext3_mb_stats_read(char *page, char **start, off_t off, + int count, int *eof, void *data) @@ -2665,6 +2894,45 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + return len; +} + ++static int ext3_mb_order2_req_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ long value; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3-fs: %s string too long, max %u bytes\n", ++ EXT3_MB_MIN_TO_SCAN_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ value = simple_strtol(str, NULL, 0); ++ if (value <= 0) ++ return -ERANGE; ++ ++ ext3_mb_order2_reqs = value; ++ ++ return count; ++} ++ ++static int ext3_mb_order2_req_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_order2_reqs); ++ *start = page; ++ return len; ++} ++ +static int ext3_mb_min_to_scan_write(struct file *file, const char *buffer, + unsigned long count, void *data) +{ @@ -2695,6 +2963,7 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + struct proc_dir_entry *proc_ext3_mb_stats; + struct proc_dir_entry *proc_ext3_mb_max_to_scan; + struct proc_dir_entry *proc_ext3_mb_min_to_scan; ++ struct proc_dir_entry *proc_ext3_mb_order2_req; + + proc_root_ext3 = proc_mkdir(EXT3_ROOT, proc_root_fs); + if (proc_root_ext3 == NULL) { @@ -2749,6 +3018,24 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + proc_ext3_mb_min_to_scan->read_proc = ext3_mb_min_to_scan_read; + proc_ext3_mb_min_to_scan->write_proc = ext3_mb_min_to_scan_write; + ++ /* Initialize EXT3_ORDER2_REQ */ ++ proc_ext3_mb_order2_req = create_proc_entry( ++ EXT3_MB_ORDER2_REQ, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_order2_req == NULL) { ++ printk(KERN_ERR "EXT3-fs: Unable to create %s\n", ++ EXT3_MB_ORDER2_REQ); ++ remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_order2_req->data = NULL; ++ proc_ext3_mb_order2_req->read_proc = ext3_mb_order2_req_read; ++ proc_ext3_mb_order2_req->write_proc = ext3_mb_order2_req_write; ++ + return 0; +} + @@ -2757,13 +3044,14 @@ Index: linux-2.6.12.6/fs/ext3/mballoc.c + remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); + remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); + remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_ORDER2_REQ, proc_root_ext3); + remove_proc_entry(EXT3_ROOT, proc_root_fs); +} -Index: linux-2.6.12.6/fs/ext3/Makefile +Index: linux-2.6.12.6-bull/fs/ext3/Makefile =================================================================== ---- linux-2.6.12.6.orig/fs/ext3/Makefile 2005-12-17 02:17:16.000000000 +0300 -+++ linux-2.6.12.6/fs/ext3/Makefile 2005-12-17 02:21:21.000000000 +0300 -@@ -6,7 +6,7 @@ +--- linux-2.6.12.6-bull.orig/fs/ext3/Makefile 2006-04-29 20:39:09.000000000 +0400 ++++ linux-2.6.12.6-bull/fs/ext3/Makefile 2006-04-29 20:39:10.000000000 +0400 +@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT3_FS) += ext3.o ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ ioctl.o namei.o super.o symlink.o hash.o resize.o \ diff --git a/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.18-vanilla.patch b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.18-vanilla.patch new file mode 100644 index 0000000..0040a6f --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-mballoc2-2.6.18-vanilla.patch @@ -0,0 +1,2810 @@ +Index: linux-stage/fs/ext3/mballoc.c +=================================================================== +--- /dev/null 1970-01-01 00:00:00.000000000 +0000 ++++ linux-stage/fs/ext3/mballoc.c 2006-07-16 02:29:49.000000000 +0800 +@@ -0,0 +1,2434 @@ ++/* ++ * Copyright (c) 2003-2005, Cluster File Systems, Inc, info@clusterfs.com ++ * Written by Alex Tomas ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License version 2 as ++ * published by the Free Software Foundation. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public Licens ++ * along with this program; if not, write to the Free Software ++ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- ++ */ ++ ++ ++/* ++ * mballoc.c contains the multiblocks allocation routines ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++/* ++ * TODO: ++ * - bitmap read-ahead (proposed by Oleg Drokin aka green) ++ * - track min/max extents in each group for better group selection ++ * - mb_mark_used() may allocate chunk right after splitting buddy ++ * - special flag to advice allocator to look for requested + N blocks ++ * this may improve interaction between extents and mballoc ++ * - tree of groups sorted by number of free blocks ++ * - percpu reservation code (hotpath) ++ * - error handling ++ */ ++ ++/* ++ * with AGRESSIVE_CHECK allocator runs consistency checks over ++ * structures. these checks slow things down a lot ++ */ ++#define AGGRESSIVE_CHECK__ ++ ++/* ++ */ ++#define MB_DEBUG__ ++#ifdef MB_DEBUG ++#define mb_debug(fmt,a...) printk(fmt, ##a) ++#else ++#define mb_debug(fmt,a...) ++#endif ++ ++/* ++ * with EXT3_MB_HISTORY mballoc stores last N allocations in memory ++ * and you can monitor it in /proc/fs/ext3//mb_history ++ */ ++#define EXT3_MB_HISTORY ++ ++/* ++ * How long mballoc can look for a best extent (in found extents) ++ */ ++long ext3_mb_max_to_scan = 500; ++ ++/* ++ * How long mballoc must look for a best extent ++ */ ++long ext3_mb_min_to_scan = 30; ++ ++/* ++ * with 'ext3_mb_stats' allocator will collect stats that will be ++ * shown at umount. The collecting costs though! ++ */ ++ ++long ext3_mb_stats = 1; ++ ++#ifdef EXT3_BB_MAX_BLOCKS ++#undef EXT3_BB_MAX_BLOCKS ++#endif ++#define EXT3_BB_MAX_BLOCKS 30 ++ ++struct ext3_free_metadata { ++ unsigned short group; ++ unsigned short num; ++ unsigned short blocks[EXT3_BB_MAX_BLOCKS]; ++ struct list_head list; ++}; ++ ++struct ext3_group_info { ++ unsigned long bb_state; ++ unsigned long bb_tid; ++ struct ext3_free_metadata *bb_md_cur; ++ unsigned short bb_first_free; ++ unsigned short bb_free; ++ unsigned short bb_fragments; ++ unsigned short bb_counters[]; ++}; ++ ++ ++#define EXT3_GROUP_INFO_NEED_INIT_BIT 0 ++#define EXT3_GROUP_INFO_LOCKED_BIT 1 ++ ++#define EXT3_MB_GRP_NEED_INIT(grp) \ ++ (test_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, &(grp)->bb_state)) ++ ++struct ext3_free_extent { ++ __u16 fe_start; ++ __u16 fe_len; ++ __u16 fe_group; ++}; ++ ++struct ext3_allocation_context { ++ struct super_block *ac_sb; ++ ++ /* search goals */ ++ struct ext3_free_extent ac_g_ex; ++ ++ /* the best found extent */ ++ struct ext3_free_extent ac_b_ex; ++ ++ /* number of iterations done. we have to track to limit searching */ ++ unsigned long ac_ex_scanned; ++ __u16 ac_groups_scanned; ++ __u16 ac_found; ++ __u16 ac_tail; ++ __u16 ac_buddy; ++ __u8 ac_status; ++ __u8 ac_flags; /* allocation hints */ ++ __u8 ac_criteria; ++ __u8 ac_repeats; ++ __u8 ac_2order; /* if request is to allocate 2^N blocks and ++ * N > 0, the field stores N, otherwise 0 */ ++}; ++ ++#define AC_STATUS_CONTINUE 1 ++#define AC_STATUS_FOUND 2 ++#define AC_STATUS_BREAK 3 ++ ++struct ext3_mb_history { ++ struct ext3_free_extent goal; /* goal allocation */ ++ struct ext3_free_extent result; /* result allocation */ ++ __u16 found; /* how many extents have been found */ ++ __u16 groups; /* how many groups have been scanned */ ++ __u16 tail; /* what tail broke some buddy */ ++ __u16 buddy; /* buddy the tail ^^^ broke */ ++ __u8 cr; /* which phase the result extent was found at */ ++ __u8 merged; ++}; ++ ++struct ext3_buddy { ++ struct page *bd_buddy_page; ++ void *bd_buddy; ++ struct page *bd_bitmap_page; ++ void *bd_bitmap; ++ struct ext3_group_info *bd_info; ++ struct super_block *bd_sb; ++ __u16 bd_blkbits; ++ __u16 bd_group; ++}; ++#define EXT3_MB_BITMAP(e3b) ((e3b)->bd_bitmap) ++#define EXT3_MB_BUDDY(e3b) ((e3b)->bd_buddy) ++ ++#ifndef EXT3_MB_HISTORY ++#define ext3_mb_store_history(sb,ac) ++#else ++static void ext3_mb_store_history(struct super_block *, ++ struct ext3_allocation_context *ac); ++#endif ++ ++#define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1) ++ ++static struct proc_dir_entry *proc_root_ext3; ++ ++int ext3_create (struct inode *, struct dentry *, int, struct nameidata *); ++struct buffer_head * read_block_bitmap(struct super_block *, unsigned int); ++int ext3_new_block_old(handle_t *, struct inode *, unsigned long, int *); ++int ext3_mb_reserve_blocks(struct super_block *, int); ++void ext3_mb_release_blocks(struct super_block *, int); ++void ext3_mb_poll_new_transaction(struct super_block *, handle_t *); ++void ext3_mb_free_committed_blocks(struct super_block *); ++ ++#if BITS_PER_LONG == 64 ++#define mb_correct_addr_and_bit(bit,addr) \ ++{ \ ++ bit += ((unsigned long) addr & 7UL) << 3; \ ++ addr = (void *) ((unsigned long) addr & ~7UL); \ ++} ++#elif BITS_PER_LONG == 32 ++#define mb_correct_addr_and_bit(bit,addr) \ ++{ \ ++ bit += ((unsigned long) addr & 3UL) << 3; \ ++ addr = (void *) ((unsigned long) addr & ~3UL); \ ++} ++#else ++#error "how many bits you are?!" ++#endif ++ ++static inline int mb_test_bit(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ return ext2_test_bit(bit, addr); ++} ++ ++static inline void mb_set_bit(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_set_bit(bit, addr); ++} ++ ++static inline void mb_set_bit_atomic(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_set_bit_atomic(NULL, bit, addr); ++} ++ ++static inline void mb_clear_bit(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_clear_bit(bit, addr); ++} ++ ++static inline void mb_clear_bit_atomic(int bit, void *addr) ++{ ++ mb_correct_addr_and_bit(bit,addr); ++ ext2_clear_bit_atomic(NULL, bit, addr); ++} ++ ++static inline int mb_find_next_zero_bit(void *addr, int max, int start) ++{ ++ int fix; ++#if BITS_PER_LONG == 64 ++ fix = ((unsigned long) addr & 7UL) << 3; ++ addr = (void *) ((unsigned long) addr & ~7UL); ++#elif BITS_PER_LONG == 32 ++ fix = ((unsigned long) addr & 3UL) << 3; ++ addr = (void *) ((unsigned long) addr & ~3UL); ++#else ++#error "how many bits you are?!" ++#endif ++ max += fix; ++ start += fix; ++ return ext2_find_next_zero_bit(addr, max, start) - fix; ++} ++ ++static inline void *mb_find_buddy(struct ext3_buddy *e3b, int order, int *max) ++{ ++ char *bb; ++ ++ J_ASSERT(EXT3_MB_BITMAP(e3b) != EXT3_MB_BUDDY(e3b)); ++ J_ASSERT(max != NULL); ++ ++ if (order > e3b->bd_blkbits + 1) { ++ *max = 0; ++ return NULL; ++ } ++ ++ /* at order 0 we see each particular block */ ++ *max = 1 << (e3b->bd_blkbits + 3); ++ if (order == 0) ++ return EXT3_MB_BITMAP(e3b); ++ ++ bb = EXT3_MB_BUDDY(e3b) + EXT3_SB(e3b->bd_sb)->s_mb_offsets[order]; ++ *max = EXT3_SB(e3b->bd_sb)->s_mb_maxs[order]; ++ ++ return bb; ++} ++ ++#ifdef AGGRESSIVE_CHECK ++ ++static void mb_check_buddy(struct ext3_buddy *e3b) ++{ ++ int order = e3b->bd_blkbits + 1; ++ int max, max2, i, j, k, count; ++ int fragments = 0, fstart; ++ void *buddy, *buddy2; ++ ++ if (!test_opt(e3b->bd_sb, MBALLOC)) ++ return; ++ ++ { ++ static int mb_check_counter = 0; ++ if (mb_check_counter++ % 300 != 0) ++ return; ++ } ++ ++ while (order > 1) { ++ buddy = mb_find_buddy(e3b, order, &max); ++ J_ASSERT(buddy); ++ buddy2 = mb_find_buddy(e3b, order - 1, &max2); ++ J_ASSERT(buddy2); ++ J_ASSERT(buddy != buddy2); ++ J_ASSERT(max * 2 == max2); ++ ++ count = 0; ++ for (i = 0; i < max; i++) { ++ ++ if (mb_test_bit(i, buddy)) { ++ /* only single bit in buddy2 may be 1 */ ++ if (!mb_test_bit(i << 1, buddy2)) ++ J_ASSERT(mb_test_bit((i<<1)+1, buddy2)); ++ else if (!mb_test_bit((i << 1) + 1, buddy2)) ++ J_ASSERT(mb_test_bit(i << 1, buddy2)); ++ continue; ++ } ++ ++ /* both bits in buddy2 must be 0 */ ++ J_ASSERT(mb_test_bit(i << 1, buddy2)); ++ J_ASSERT(mb_test_bit((i << 1) + 1, buddy2)); ++ ++ for (j = 0; j < (1 << order); j++) { ++ k = (i * (1 << order)) + j; ++ J_ASSERT(!mb_test_bit(k, EXT3_MB_BITMAP(e3b))); ++ } ++ count++; ++ } ++ J_ASSERT(e3b->bd_info->bb_counters[order] == count); ++ order--; ++ } ++ ++ fstart = -1; ++ buddy = mb_find_buddy(e3b, 0, &max); ++ for (i = 0; i < max; i++) { ++ if (!mb_test_bit(i, buddy)) { ++ J_ASSERT(i >= e3b->bd_info->bb_first_free); ++ if (fstart == -1) { ++ fragments++; ++ fstart = i; ++ } ++ continue; ++ } ++ fstart = -1; ++ /* check used bits only */ ++ for (j = 0; j < e3b->bd_blkbits + 1; j++) { ++ buddy2 = mb_find_buddy(e3b, j, &max2); ++ k = i >> j; ++ J_ASSERT(k < max2); ++ J_ASSERT(mb_test_bit(k, buddy2)); ++ } ++ } ++ J_ASSERT(!EXT3_MB_GRP_NEED_INIT(e3b->bd_info)); ++ J_ASSERT(e3b->bd_info->bb_fragments == fragments); ++} ++ ++#else ++#define mb_check_buddy(e3b) ++#endif ++ ++/* find most significant bit */ ++static int inline fmsb(unsigned short word) ++{ ++ int order; ++ ++ if (word > 255) { ++ order = 7; ++ word >>= 8; ++ } else { ++ order = -1; ++ } ++ ++ do { ++ order++; ++ word >>= 1; ++ } while (word != 0); ++ ++ return order; ++} ++ ++static void inline ++ext3_mb_mark_free_simple(struct super_block *sb, void *buddy, unsigned first, ++ int len, struct ext3_group_info *grp) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ unsigned short min, max, chunk, border; ++ ++ mb_debug("mark %u/%u free\n", first, len); ++ J_ASSERT(len < EXT3_BLOCKS_PER_GROUP(sb)); ++ ++ border = 2 << sb->s_blocksize_bits; ++ ++ while (len > 0) { ++ /* find how many blocks can be covered since this position */ ++ max = ffs(first | border) - 1; ++ ++ /* find how many blocks of power 2 we need to mark */ ++ min = fmsb(len); ++ ++ mb_debug(" %u/%u -> max %u, min %u\n", ++ first & ((2 << sb->s_blocksize_bits) - 1), ++ len, max, min); ++ ++ if (max < min) ++ min = max; ++ chunk = 1 << min; ++ ++ /* mark multiblock chunks only */ ++ grp->bb_counters[min]++; ++ if (min > 0) { ++ mb_debug(" set %u at %u \n", first >> min, ++ sbi->s_mb_offsets[min]); ++ mb_clear_bit(first >> min, buddy + sbi->s_mb_offsets[min]); ++ } ++ ++ len -= chunk; ++ first += chunk; ++ } ++} ++ ++static void ++ext3_mb_generate_buddy(struct super_block *sb, void *buddy, void *bitmap, ++ struct ext3_group_info *grp) ++{ ++ unsigned short max = EXT3_BLOCKS_PER_GROUP(sb); ++ unsigned short i = 0, first, len; ++ unsigned free = 0, fragments = 0; ++ unsigned long long period = get_cycles(); ++ ++ i = mb_find_next_zero_bit(bitmap, max, 0); ++ grp->bb_first_free = i; ++ while (i < max) { ++ fragments++; ++ first = i; ++ i = find_next_bit(bitmap, max, i); ++ len = i - first; ++ free += len; ++ if (len > 1) ++ ext3_mb_mark_free_simple(sb, buddy, first, len, grp); ++ else ++ grp->bb_counters[0]++; ++ if (i < max) ++ i = mb_find_next_zero_bit(bitmap, max, i); ++ } ++ grp->bb_fragments = fragments; ++ ++ /* bb_state shouldn't being modified because all ++ * others waits for init completion on page lock */ ++ clear_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, &grp->bb_state); ++ if (free != grp->bb_free) { ++ printk("EXT3-fs: %u blocks in bitmap, %u in group descriptor\n", ++ free, grp->bb_free); ++ grp->bb_free = free; ++ } ++ ++ period = get_cycles() - period; ++ spin_lock(&EXT3_SB(sb)->s_bal_lock); ++ EXT3_SB(sb)->s_mb_buddies_generated++; ++ EXT3_SB(sb)->s_mb_generation_time += period; ++ spin_unlock(&EXT3_SB(sb)->s_bal_lock); ++} ++ ++static int ext3_mb_init_cache(struct page *page) ++{ ++ int blocksize, blocks_per_page, groups_per_page; ++ int err = 0, i, first_group, first_block; ++ struct super_block *sb; ++ struct buffer_head *bhs; ++ struct buffer_head **bh; ++ struct inode *inode; ++ char *data, *bitmap; ++ ++ mb_debug("init page %lu\n", page->index); ++ ++ inode = page->mapping->host; ++ sb = inode->i_sb; ++ blocksize = 1 << inode->i_blkbits; ++ blocks_per_page = PAGE_CACHE_SIZE / blocksize; ++ ++ groups_per_page = blocks_per_page >> 1; ++ if (groups_per_page == 0) ++ groups_per_page = 1; ++ ++ /* allocate buffer_heads to read bitmaps */ ++ if (groups_per_page > 1) { ++ err = -ENOMEM; ++ i = sizeof(struct buffer_head *) * groups_per_page; ++ bh = kmalloc(i, GFP_NOFS); ++ if (bh == NULL) ++ goto out; ++ memset(bh, 0, i); ++ } else ++ bh = &bhs; ++ ++ first_group = page->index * blocks_per_page / 2; ++ ++ /* read all groups the page covers into the cache */ ++ for (i = 0; i < groups_per_page; i++) { ++ struct ext3_group_desc * desc; ++ ++ if (first_group + i >= EXT3_SB(sb)->s_groups_count) ++ break; ++ ++ err = -EIO; ++ desc = ext3_get_group_desc(sb, first_group + i, NULL); ++ if (desc == NULL) ++ goto out; ++ ++ err = -ENOMEM; ++ bh[i] = sb_getblk(sb, le32_to_cpu(desc->bg_block_bitmap)); ++ if (bh[i] == NULL) ++ goto out; ++ ++ if (buffer_uptodate(bh[i])) ++ continue; ++ ++ lock_buffer(bh[i]); ++ if (buffer_uptodate(bh[i])) { ++ unlock_buffer(bh[i]); ++ continue; ++ } ++ ++ get_bh(bh[i]); ++ bh[i]->b_end_io = end_buffer_read_sync; ++ submit_bh(READ, bh[i]); ++ mb_debug("read bitmap for group %u\n", first_group + i); ++ } ++ ++ /* wait for I/O completion */ ++ for (i = 0; i < groups_per_page && bh[i]; i++) ++ wait_on_buffer(bh[i]); ++ ++ /* XXX: I/O error handling here */ ++ ++ first_block = page->index * blocks_per_page; ++ for (i = 0; i < blocks_per_page; i++) { ++ int group; ++ ++ group = (first_block + i) >> 1; ++ if (group >= EXT3_SB(sb)->s_groups_count) ++ break; ++ ++ data = page_address(page) + (i * blocksize); ++ bitmap = bh[group - first_group]->b_data; ++ ++ if ((first_block + i) & 1) { ++ /* this is block of buddy */ ++ mb_debug("put buddy for group %u in page %lu/%x\n", ++ group, page->index, i * blocksize); ++ memset(data, 0xff, blocksize); ++ EXT3_SB(sb)->s_group_info[group]->bb_fragments = 0; ++ memset(EXT3_SB(sb)->s_group_info[group]->bb_counters, 0, ++ sizeof(unsigned short)*(sb->s_blocksize_bits+2)); ++ ext3_mb_generate_buddy(sb, data, bitmap, ++ EXT3_SB(sb)->s_group_info[group]); ++ } else { ++ /* this is block of bitmap */ ++ mb_debug("put bitmap for group %u in page %lu/%x\n", ++ group, page->index, i * blocksize); ++ memcpy(data, bitmap, blocksize); ++ } ++ } ++ SetPageUptodate(page); ++ ++out: ++ for (i = 0; i < groups_per_page && bh[i]; i++) ++ brelse(bh[i]); ++ if (bh && bh != &bhs) ++ kfree(bh); ++ return err; ++} ++ ++static int ext3_mb_load_buddy(struct super_block *sb, int group, ++ struct ext3_buddy *e3b) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct inode *inode = sbi->s_buddy_cache; ++ int blocks_per_page, block, pnum, poff; ++ struct page *page; ++ ++ mb_debug("load group %u\n", group); ++ ++ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; ++ ++ e3b->bd_blkbits = sb->s_blocksize_bits; ++ e3b->bd_info = sbi->s_group_info[group]; ++ e3b->bd_sb = sb; ++ e3b->bd_group = group; ++ e3b->bd_buddy_page = NULL; ++ e3b->bd_bitmap_page = NULL; ++ ++ block = group * 2; ++ pnum = block / blocks_per_page; ++ poff = block % blocks_per_page; ++ ++ page = find_get_page(inode->i_mapping, pnum); ++ if (page == NULL || !PageUptodate(page)) { ++ if (page) ++ page_cache_release(page); ++ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); ++ if (page) { ++ if (!PageUptodate(page)) ++ ext3_mb_init_cache(page); ++ unlock_page(page); ++ } ++ } ++ if (page == NULL || !PageUptodate(page)) ++ goto err; ++ e3b->bd_bitmap_page = page; ++ e3b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); ++ mark_page_accessed(page); ++ ++ block++; ++ pnum = block / blocks_per_page; ++ poff = block % blocks_per_page; ++ ++ page = find_get_page(inode->i_mapping, pnum); ++ if (page == NULL || !PageUptodate(page)) { ++ if (page) ++ page_cache_release(page); ++ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); ++ if (page) { ++ if (!PageUptodate(page)) ++ ext3_mb_init_cache(page); ++ unlock_page(page); ++ } ++ } ++ if (page == NULL || !PageUptodate(page)) ++ goto err; ++ e3b->bd_buddy_page = page; ++ e3b->bd_buddy = page_address(page) + (poff * sb->s_blocksize); ++ mark_page_accessed(page); ++ ++ J_ASSERT(e3b->bd_bitmap_page != NULL); ++ J_ASSERT(e3b->bd_buddy_page != NULL); ++ ++ return 0; ++ ++err: ++ if (e3b->bd_bitmap_page) ++ page_cache_release(e3b->bd_bitmap_page); ++ if (e3b->bd_buddy_page) ++ page_cache_release(e3b->bd_buddy_page); ++ e3b->bd_buddy = NULL; ++ e3b->bd_bitmap = NULL; ++ return -EIO; ++} ++ ++static void ext3_mb_release_desc(struct ext3_buddy *e3b) ++{ ++ if (e3b->bd_bitmap_page) ++ page_cache_release(e3b->bd_bitmap_page); ++ if (e3b->bd_buddy_page) ++ page_cache_release(e3b->bd_buddy_page); ++} ++ ++ ++static inline void ++ext3_lock_group(struct super_block *sb, int group) ++{ ++ bit_spin_lock(EXT3_GROUP_INFO_LOCKED_BIT, ++ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++} ++ ++static inline void ++ext3_unlock_group(struct super_block *sb, int group) ++{ ++ bit_spin_unlock(EXT3_GROUP_INFO_LOCKED_BIT, ++ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++} ++ ++static int mb_find_order_for_block(struct ext3_buddy *e3b, int block) ++{ ++ int order = 1; ++ void *bb; ++ ++ J_ASSERT(EXT3_MB_BITMAP(e3b) != EXT3_MB_BUDDY(e3b)); ++ J_ASSERT(block < (1 << (e3b->bd_blkbits + 3))); ++ ++ bb = EXT3_MB_BUDDY(e3b); ++ while (order <= e3b->bd_blkbits + 1) { ++ block = block >> 1; ++ if (!mb_test_bit(block, bb)) { ++ /* this block is part of buddy of order 'order' */ ++ return order; ++ } ++ bb += 1 << (e3b->bd_blkbits - order); ++ order++; ++ } ++ return 0; ++} ++ ++static inline void mb_clear_bits(void *bm, int cur, int len) ++{ ++ __u32 *addr; ++ ++ len = cur + len; ++ while (cur < len) { ++ if ((cur & 31) == 0 && (len - cur) >= 32) { ++ /* fast path: clear whole word at once */ ++ addr = bm + (cur >> 3); ++ *addr = 0; ++ cur += 32; ++ continue; ++ } ++ mb_clear_bit_atomic(cur, bm); ++ cur++; ++ } ++} ++ ++static inline void mb_set_bits(void *bm, int cur, int len) ++{ ++ __u32 *addr; ++ ++ len = cur + len; ++ while (cur < len) { ++ if ((cur & 31) == 0 && (len - cur) >= 32) { ++ /* fast path: clear whole word at once */ ++ addr = bm + (cur >> 3); ++ *addr = 0xffffffff; ++ cur += 32; ++ continue; ++ } ++ mb_set_bit_atomic(cur, bm); ++ cur++; ++ } ++} ++ ++static int mb_free_blocks(struct ext3_buddy *e3b, int first, int count) ++{ ++ int block = 0, max = 0, order; ++ void *buddy, *buddy2; ++ ++ mb_check_buddy(e3b); ++ ++ e3b->bd_info->bb_free += count; ++ if (first < e3b->bd_info->bb_first_free) ++ e3b->bd_info->bb_first_free = first; ++ ++ /* let's maintain fragments counter */ ++ if (first != 0) ++ block = !mb_test_bit(first - 1, EXT3_MB_BITMAP(e3b)); ++ if (first + count < EXT3_SB(e3b->bd_sb)->s_mb_maxs[0]) ++ max = !mb_test_bit(first + count, EXT3_MB_BITMAP(e3b)); ++ if (block && max) ++ e3b->bd_info->bb_fragments--; ++ else if (!block && !max) ++ e3b->bd_info->bb_fragments++; ++ ++ /* let's maintain buddy itself */ ++ while (count-- > 0) { ++ block = first++; ++ order = 0; ++ ++ J_ASSERT(mb_test_bit(block, EXT3_MB_BITMAP(e3b))); ++ mb_clear_bit(block, EXT3_MB_BITMAP(e3b)); ++ e3b->bd_info->bb_counters[order]++; ++ ++ /* start of the buddy */ ++ buddy = mb_find_buddy(e3b, order, &max); ++ ++ do { ++ block &= ~1UL; ++ if (mb_test_bit(block, buddy) || ++ mb_test_bit(block + 1, buddy)) ++ break; ++ ++ /* both the buddies are free, try to coalesce them */ ++ buddy2 = mb_find_buddy(e3b, order + 1, &max); ++ ++ if (!buddy2) ++ break; ++ ++ if (order > 0) { ++ /* for special purposes, we don't set ++ * free bits in bitmap */ ++ mb_set_bit(block, buddy); ++ mb_set_bit(block + 1, buddy); ++ } ++ e3b->bd_info->bb_counters[order]--; ++ e3b->bd_info->bb_counters[order]--; ++ ++ block = block >> 1; ++ order++; ++ e3b->bd_info->bb_counters[order]++; ++ ++ mb_clear_bit(block, buddy2); ++ buddy = buddy2; ++ } while (1); ++ } ++ mb_check_buddy(e3b); ++ ++ return 0; ++} ++ ++static int mb_find_extent(struct ext3_buddy *e3b, int order, int block, ++ int needed, struct ext3_free_extent *ex) ++{ ++ int next, max, ord; ++ void *buddy; ++ ++ J_ASSERT(ex != NULL); ++ ++ buddy = mb_find_buddy(e3b, order, &max); ++ J_ASSERT(buddy); ++ J_ASSERT(block < max); ++ if (mb_test_bit(block, buddy)) { ++ ex->fe_len = 0; ++ ex->fe_start = 0; ++ ex->fe_group = 0; ++ return 0; ++ } ++ ++ if (likely(order == 0)) { ++ /* find actual order */ ++ order = mb_find_order_for_block(e3b, block); ++ block = block >> order; ++ } ++ ++ ex->fe_len = 1 << order; ++ ex->fe_start = block << order; ++ ex->fe_group = e3b->bd_group; ++ ++ while (needed > ex->fe_len && (buddy = mb_find_buddy(e3b, order, &max))) { ++ ++ if (block + 1 >= max) ++ break; ++ ++ next = (block + 1) * (1 << order); ++ if (mb_test_bit(next, EXT3_MB_BITMAP(e3b))) ++ break; ++ ++ ord = mb_find_order_for_block(e3b, next); ++ ++ order = ord; ++ block = next >> order; ++ ex->fe_len += 1 << order; ++ } ++ ++ J_ASSERT(ex->fe_start + ex->fe_len <= (1 << (e3b->bd_blkbits + 3))); ++ return ex->fe_len; ++} ++ ++static int mb_mark_used(struct ext3_buddy *e3b, struct ext3_free_extent *ex) ++{ ++ int ord, mlen = 0, max = 0, cur; ++ int start = ex->fe_start; ++ int len = ex->fe_len; ++ unsigned ret = 0; ++ int len0 = len; ++ void *buddy; ++ ++ mb_check_buddy(e3b); ++ ++ e3b->bd_info->bb_free -= len; ++ if (e3b->bd_info->bb_first_free == start) ++ e3b->bd_info->bb_first_free += len; ++ ++ /* let's maintain fragments counter */ ++ if (start != 0) ++ mlen = !mb_test_bit(start - 1, EXT3_MB_BITMAP(e3b)); ++ if (start + len < EXT3_SB(e3b->bd_sb)->s_mb_maxs[0]) ++ max = !mb_test_bit(start + len, EXT3_MB_BITMAP(e3b)); ++ if (mlen && max) ++ e3b->bd_info->bb_fragments++; ++ else if (!mlen && !max) ++ e3b->bd_info->bb_fragments--; ++ ++ /* let's maintain buddy itself */ ++ while (len) { ++ ord = mb_find_order_for_block(e3b, start); ++ ++ if (((start >> ord) << ord) == start && len >= (1 << ord)) { ++ /* the whole chunk may be allocated at once! */ ++ mlen = 1 << ord; ++ buddy = mb_find_buddy(e3b, ord, &max); ++ J_ASSERT((start >> ord) < max); ++ mb_set_bit(start >> ord, buddy); ++ e3b->bd_info->bb_counters[ord]--; ++ start += mlen; ++ len -= mlen; ++ J_ASSERT(len >= 0); ++ continue; ++ } ++ ++ /* store for history */ ++ if (ret == 0) ++ ret = len | (ord << 16); ++ ++ /* we have to split large buddy */ ++ J_ASSERT(ord > 0); ++ buddy = mb_find_buddy(e3b, ord, &max); ++ mb_set_bit(start >> ord, buddy); ++ e3b->bd_info->bb_counters[ord]--; ++ ++ ord--; ++ cur = (start >> ord) & ~1U; ++ buddy = mb_find_buddy(e3b, ord, &max); ++ mb_clear_bit(cur, buddy); ++ mb_clear_bit(cur + 1, buddy); ++ e3b->bd_info->bb_counters[ord]++; ++ e3b->bd_info->bb_counters[ord]++; ++ } ++ ++ /* now drop all the bits in bitmap */ ++ mb_set_bits(EXT3_MB_BITMAP(e3b), ex->fe_start, len0); ++ ++ mb_check_buddy(e3b); ++ ++ return ret; ++} ++ ++/* ++ * Must be called under group lock! ++ */ ++static void ext3_mb_use_best_found(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ unsigned long ret; ++ ++ ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len); ++ ret = mb_mark_used(e3b, &ac->ac_b_ex); ++ ++ ac->ac_status = AC_STATUS_FOUND; ++ ac->ac_tail = ret & 0xffff; ++ ac->ac_buddy = ret >> 16; ++} ++ ++/* ++ * The routine checks whether found extent is good enough. If it is, ++ * then the extent gets marked used and flag is set to the context ++ * to stop scanning. Otherwise, the extent is compared with the ++ * previous found extent and if new one is better, then it's stored ++ * in the context. Later, the best found extent will be used, if ++ * mballoc can't find good enough extent. ++ * ++ * FIXME: real allocation policy is to be designed yet! ++ */ ++static void ext3_mb_measure_extent(struct ext3_allocation_context *ac, ++ struct ext3_free_extent *ex, ++ struct ext3_buddy *e3b) ++{ ++ struct ext3_free_extent *bex = &ac->ac_b_ex; ++ struct ext3_free_extent *gex = &ac->ac_g_ex; ++ ++ J_ASSERT(ex->fe_len > 0); ++ J_ASSERT(ex->fe_len < (1 << ac->ac_sb->s_blocksize_bits) * 8); ++ J_ASSERT(ex->fe_start < (1 << ac->ac_sb->s_blocksize_bits) * 8); ++ ++ ac->ac_found++; ++ ++ /* ++ * The special case - take what you catch first ++ */ ++ if (unlikely(ac->ac_flags & EXT3_MB_HINT_FIRST)) { ++ *bex = *ex; ++ ext3_mb_use_best_found(ac, e3b); ++ return; ++ } ++ ++ /* ++ * Let's check whether the chuck is good enough ++ */ ++ if (ex->fe_len == gex->fe_len) { ++ *bex = *ex; ++ ext3_mb_use_best_found(ac, e3b); ++ return; ++ } ++ ++ /* ++ * If this is first found extent, just store it in the context ++ */ ++ if (bex->fe_len == 0) { ++ *bex = *ex; ++ return; ++ } ++ ++ /* ++ * If new found extent is better, store it in the context ++ */ ++ if (bex->fe_len < gex->fe_len) { ++ /* if the request isn't satisfied, any found extent ++ * larger than previous best one is better */ ++ if (ex->fe_len > bex->fe_len) ++ *bex = *ex; ++ } else if (ex->fe_len > gex->fe_len) { ++ /* if the request is satisfied, then we try to find ++ * an extent that still satisfy the request, but is ++ * smaller than previous one */ ++ *bex = *ex; ++ } ++ ++ /* ++ * Let's scan at least few extents and don't pick up a first one ++ */ ++ if (bex->fe_len > gex->fe_len && ac->ac_found > ext3_mb_min_to_scan) ++ ac->ac_status = AC_STATUS_BREAK; ++ ++ /* ++ * We don't want to scan for a whole year ++ */ ++ if (ac->ac_found > ext3_mb_max_to_scan) ++ ac->ac_status = AC_STATUS_BREAK; ++} ++ ++static int ext3_mb_try_best_found(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct ext3_free_extent ex = ac->ac_b_ex; ++ int group = ex.fe_group, max, err; ++ ++ J_ASSERT(ex.fe_len > 0); ++ err = ext3_mb_load_buddy(ac->ac_sb, group, e3b); ++ if (err) ++ return err; ++ ++ ext3_lock_group(ac->ac_sb, group); ++ max = mb_find_extent(e3b, 0, ex.fe_start, ex.fe_len, &ex); ++ ++ if (max > 0) { ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } ++ ++ ext3_unlock_group(ac->ac_sb, group); ++ ++ ext3_mb_release_desc(e3b); ++ ++ return 0; ++} ++ ++static int ext3_mb_find_by_goal(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ int group = ac->ac_g_ex.fe_group, max, err; ++ struct ext3_free_extent ex; ++ ++ err = ext3_mb_load_buddy(ac->ac_sb, group, e3b); ++ if (err) ++ return err; ++ ++ ext3_lock_group(ac->ac_sb, group); ++ max = mb_find_extent(e3b, 0, ac->ac_g_ex.fe_start, ++ ac->ac_g_ex.fe_len, &ex); ++ ++ if (max > 0) { ++ J_ASSERT(ex.fe_len > 0); ++ J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); ++ J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } ++ ext3_unlock_group(ac->ac_sb, group); ++ ++ ext3_mb_release_desc(e3b); ++ ++ return 0; ++} ++ ++/* ++ * The routine scans buddy structures (not bitmap!) from given order ++ * to max order and tries to find big enough chunk to satisfy the req ++ */ ++static void ext3_mb_simple_scan_group(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct super_block *sb = ac->ac_sb; ++ struct ext3_group_info *grp = e3b->bd_info; ++ void *buddy; ++ int i, k, max; ++ ++ J_ASSERT(ac->ac_2order > 0); ++ for (i = ac->ac_2order; i < sb->s_blocksize_bits + 1; i++) { ++ if (grp->bb_counters[i] == 0) ++ continue; ++ ++ buddy = mb_find_buddy(e3b, i, &max); ++ if (buddy == NULL) { ++ printk(KERN_ALERT "looking for wrong order?\n"); ++ break; ++ } ++ ++ k = mb_find_next_zero_bit(buddy, max, 0); ++ J_ASSERT(k < max); ++ ++ ac->ac_found++; ++ ++ ac->ac_b_ex.fe_len = 1 << i; ++ ac->ac_b_ex.fe_start = k << i; ++ ac->ac_b_ex.fe_group = e3b->bd_group; ++ ++ ext3_mb_use_best_found(ac, e3b); ++ J_ASSERT(ac->ac_b_ex.fe_len == ac->ac_g_ex.fe_len); ++ ++ if (unlikely(ext3_mb_stats)) ++ atomic_inc(&EXT3_SB(sb)->s_bal_2orders); ++ ++ break; ++ } ++} ++ ++/* ++ * The routine scans the group and measures all found extents. ++ * In order to optimize scanning, caller must pass number of ++ * free blocks in the group, so the routine can know upper limit. ++ */ ++static void ext3_mb_complex_scan_group(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct super_block *sb = ac->ac_sb; ++ void *bitmap = EXT3_MB_BITMAP(e3b); ++ struct ext3_free_extent ex; ++ int i, free; ++ ++ free = e3b->bd_info->bb_free; ++ J_ASSERT(free > 0); ++ ++ i = e3b->bd_info->bb_first_free; ++ ++ while (free && ac->ac_status == AC_STATUS_CONTINUE) { ++ i = mb_find_next_zero_bit(bitmap, sb->s_blocksize * 8, i); ++ if (i >= sb->s_blocksize * 8) { ++ J_ASSERT(free == 0); ++ break; ++ } ++ ++ mb_find_extent(e3b, 0, i, ac->ac_g_ex.fe_len, &ex); ++ J_ASSERT(ex.fe_len > 0); ++ J_ASSERT(free >= ex.fe_len); ++ ++ ext3_mb_measure_extent(ac, &ex, e3b); ++ ++ i += ex.fe_len; ++ free -= ex.fe_len; ++ } ++} ++ ++static int ext3_mb_good_group(struct ext3_allocation_context *ac, ++ int group, int cr) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); ++ struct ext3_group_info *grp = sbi->s_group_info[group]; ++ unsigned free, fragments, i, bits; ++ ++ J_ASSERT(cr >= 0 && cr < 4); ++ J_ASSERT(!EXT3_MB_GRP_NEED_INIT(grp)); ++ ++ free = grp->bb_free; ++ fragments = grp->bb_fragments; ++ if (free == 0) ++ return 0; ++ if (fragments == 0) ++ return 0; ++ ++ switch (cr) { ++ case 0: ++ J_ASSERT(ac->ac_2order != 0); ++ bits = ac->ac_sb->s_blocksize_bits + 1; ++ for (i = ac->ac_2order; i < bits; i++) ++ if (grp->bb_counters[i] > 0) ++ return 1; ++ case 1: ++ if ((free / fragments) >= ac->ac_g_ex.fe_len) ++ return 1; ++ case 2: ++ if (free >= ac->ac_g_ex.fe_len) ++ return 1; ++ case 3: ++ return 1; ++ default: ++ BUG(); ++ } ++ ++ return 0; ++} ++ ++int ext3_mb_new_blocks(handle_t *handle, struct inode *inode, ++ unsigned long goal, int *len, int flags, int *errp) ++{ ++ struct buffer_head *bitmap_bh = NULL; ++ struct ext3_allocation_context ac; ++ int i, group, block, cr, err = 0; ++ struct ext3_group_desc *gdp; ++ struct ext3_super_block *es; ++ struct buffer_head *gdp_bh; ++ struct ext3_sb_info *sbi; ++ struct super_block *sb; ++ struct ext3_buddy e3b; ++ ++ J_ASSERT(len != NULL); ++ J_ASSERT(*len > 0); ++ ++ sb = inode->i_sb; ++ if (!sb) { ++ printk("ext3_mb_new_nblocks: nonexistent device"); ++ return 0; ++ } ++ ++ if (!test_opt(sb, MBALLOC)) { ++ static int ext3_mballoc_warning = 0; ++ if (ext3_mballoc_warning == 0) { ++ printk(KERN_ERR "EXT3-fs: multiblock request with " ++ "mballoc disabled!\n"); ++ ext3_mballoc_warning++; ++ } ++ *len = 1; ++ err = ext3_new_block_old(handle, inode, goal, errp); ++ return err; ++ } ++ ++ ext3_mb_poll_new_transaction(sb, handle); ++ ++ sbi = EXT3_SB(sb); ++ es = EXT3_SB(sb)->s_es; ++ ++ /* ++ * We can't allocate > group size ++ */ ++ if (*len >= EXT3_BLOCKS_PER_GROUP(sb) - 10) ++ *len = EXT3_BLOCKS_PER_GROUP(sb) - 10; ++ ++ if (!(flags & EXT3_MB_HINT_RESERVED)) { ++ /* someone asks for non-reserved blocks */ ++ BUG_ON(*len > 1); ++ err = ext3_mb_reserve_blocks(sb, 1); ++ if (err) { ++ *errp = err; ++ return 0; ++ } ++ } ++ ++ /* ++ * Check quota for allocation of this blocks. ++ */ ++ while (*len && DQUOT_ALLOC_BLOCK(inode, *len)) ++ *len -= 1; ++ if (*len == 0) { ++ *errp = -EDQUOT; ++ block = 0; ++ goto out; ++ } ++ ++ /* start searching from the goal */ ++ if (goal < le32_to_cpu(es->s_first_data_block) || ++ goal >= le32_to_cpu(es->s_blocks_count)) ++ goal = le32_to_cpu(es->s_first_data_block); ++ group = (goal - le32_to_cpu(es->s_first_data_block)) / ++ EXT3_BLOCKS_PER_GROUP(sb); ++ block = ((goal - le32_to_cpu(es->s_first_data_block)) % ++ EXT3_BLOCKS_PER_GROUP(sb)); ++ ++ /* set up allocation goals */ ++ ac.ac_b_ex.fe_group = 0; ++ ac.ac_b_ex.fe_start = 0; ++ ac.ac_b_ex.fe_len = 0; ++ ac.ac_status = AC_STATUS_CONTINUE; ++ ac.ac_groups_scanned = 0; ++ ac.ac_ex_scanned = 0; ++ ac.ac_found = 0; ++ ac.ac_sb = inode->i_sb; ++ ac.ac_g_ex.fe_group = group; ++ ac.ac_g_ex.fe_start = block; ++ ac.ac_g_ex.fe_len = *len; ++ ac.ac_flags = flags; ++ ac.ac_2order = 0; ++ ac.ac_criteria = 0; ++ ++ /* probably, the request is for 2^8+ blocks (1/2/3/... MB) */ ++ i = ffs(*len); ++ if (i >= 8) { ++ i--; ++ if ((*len & (~(1 << i))) == 0) ++ ac.ac_2order = i; ++ } ++ ++ /* Sometimes, caller may want to merge even small ++ * number of blocks to an existing extent */ ++ if (ac.ac_flags & EXT3_MB_HINT_MERGE) { ++ err = ext3_mb_find_by_goal(&ac, &e3b); ++ if (err) ++ goto out_err; ++ if (ac.ac_status == AC_STATUS_FOUND) ++ goto found; ++ } ++ ++ /* Let's just scan groups to find more-less suitable blocks */ ++ cr = ac.ac_2order ? 0 : 1; ++repeat: ++ for (; cr < 4 && ac.ac_status == AC_STATUS_CONTINUE; cr++) { ++ ac.ac_criteria = cr; ++ for (i = 0; i < EXT3_SB(sb)->s_groups_count; group++, i++) { ++ if (group == EXT3_SB(sb)->s_groups_count) ++ group = 0; ++ ++ if (EXT3_MB_GRP_NEED_INIT(sbi->s_group_info[group])) { ++ /* we need full data about the group ++ * to make a good selection */ ++ err = ext3_mb_load_buddy(ac.ac_sb, group, &e3b); ++ if (err) ++ goto out_err; ++ ext3_mb_release_desc(&e3b); ++ } ++ ++ /* check is group good for our criteries */ ++ if (!ext3_mb_good_group(&ac, group, cr)) ++ continue; ++ ++ err = ext3_mb_load_buddy(ac.ac_sb, group, &e3b); ++ if (err) ++ goto out_err; ++ ++ ext3_lock_group(sb, group); ++ if (!ext3_mb_good_group(&ac, group, cr)) { ++ /* someone did allocation from this group */ ++ ext3_unlock_group(sb, group); ++ ext3_mb_release_desc(&e3b); ++ continue; ++ } ++ ++ ac.ac_groups_scanned++; ++ if (cr == 0) ++ ext3_mb_simple_scan_group(&ac, &e3b); ++ else ++ ext3_mb_complex_scan_group(&ac, &e3b); ++ ++ ext3_unlock_group(sb, group); ++ ++ ext3_mb_release_desc(&e3b); ++ ++ if (err) ++ goto out_err; ++ if (ac.ac_status != AC_STATUS_CONTINUE) ++ break; ++ } ++ } ++ ++ if (ac.ac_b_ex.fe_len > 0 && ac.ac_status != AC_STATUS_FOUND && ++ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { ++ /* ++ * We've been searching too long. Let's try to allocate ++ * the best chunk we've found so far ++ */ ++ ++ /*if (ac.ac_found > ext3_mb_max_to_scan) ++ printk(KERN_ERR "EXT3-fs: too long searching at " ++ "%u (%d/%d)\n", cr, ac.ac_b_ex.fe_len, ++ ac.ac_g_ex.fe_len);*/ ++ ext3_mb_try_best_found(&ac, &e3b); ++ if (ac.ac_status != AC_STATUS_FOUND) { ++ /* ++ * Someone more lucky has already allocated it. ++ * The only thing we can do is just take first ++ * found block(s) ++ */ ++ printk(KERN_ERR "EXT3-fs: and someone won our chunk\n"); ++ ac.ac_b_ex.fe_group = 0; ++ ac.ac_b_ex.fe_start = 0; ++ ac.ac_b_ex.fe_len = 0; ++ ac.ac_status = AC_STATUS_CONTINUE; ++ ac.ac_flags |= EXT3_MB_HINT_FIRST; ++ cr = 3; ++ goto repeat; ++ } ++ } ++ ++ if (ac.ac_status != AC_STATUS_FOUND) { ++ /* ++ * We aren't lucky definitely ++ */ ++ DQUOT_FREE_BLOCK(inode, *len); ++ *errp = -ENOSPC; ++ block = 0; ++#if 1 ++ printk(KERN_ERR "EXT3-fs: cant allocate: status %d, flags %d\n", ++ ac.ac_status, ac.ac_flags); ++ printk(KERN_ERR "EXT3-fs: goal %d, best found %d/%d/%d, cr %d\n", ++ ac.ac_g_ex.fe_len, ac.ac_b_ex.fe_group, ++ ac.ac_b_ex.fe_start, ac.ac_b_ex.fe_len, cr); ++ printk(KERN_ERR "EXT3-fs: %lu block reserved, %d found\n", ++ sbi->s_blocks_reserved, ac.ac_found); ++ printk("EXT3-fs: groups: "); ++ for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) ++ printk("%d: %d ", i, ++ sbi->s_group_info[i]->bb_free); ++ printk("\n"); ++#endif ++ goto out; ++ } ++ ++found: ++ J_ASSERT(ac.ac_b_ex.fe_len > 0); ++ ++ /* good news - free block(s) have been found. now it's time ++ * to mark block(s) in good old journaled bitmap */ ++ block = ac.ac_b_ex.fe_group * EXT3_BLOCKS_PER_GROUP(sb) ++ + ac.ac_b_ex.fe_start ++ + le32_to_cpu(es->s_first_data_block); ++ ++ /* we made a desicion, now mark found blocks in good old ++ * bitmap to be journaled */ ++ ++ ext3_debug("using block group %d(%d)\n", ++ ac.ac_b_group.group, gdp->bg_free_blocks_count); ++ ++ bitmap_bh = read_block_bitmap(sb, ac.ac_b_ex.fe_group); ++ if (!bitmap_bh) { ++ *errp = -EIO; ++ goto out_err; ++ } ++ ++ err = ext3_journal_get_write_access(handle, bitmap_bh); ++ if (err) { ++ *errp = err; ++ goto out_err; ++ } ++ ++ gdp = ext3_get_group_desc(sb, ac.ac_b_ex.fe_group, &gdp_bh); ++ if (!gdp) { ++ *errp = -EIO; ++ goto out_err; ++ } ++ ++ err = ext3_journal_get_write_access(handle, gdp_bh); ++ if (err) ++ goto out_err; ++ ++ block = ac.ac_b_ex.fe_group * EXT3_BLOCKS_PER_GROUP(sb) ++ + ac.ac_b_ex.fe_start ++ + le32_to_cpu(es->s_first_data_block); ++ ++ if (block == le32_to_cpu(gdp->bg_block_bitmap) || ++ block == le32_to_cpu(gdp->bg_inode_bitmap) || ++ in_range(block, le32_to_cpu(gdp->bg_inode_table), ++ EXT3_SB(sb)->s_itb_per_group)) ++ ext3_error(sb, "ext3_new_block", ++ "Allocating block in system zone - " ++ "block = %u", block); ++#ifdef AGGRESSIVE_CHECK ++ for (i = 0; i < ac.ac_b_ex.fe_len; i++) ++ J_ASSERT(!mb_test_bit(ac.ac_b_ex.fe_start + i, bitmap_bh->b_data)); ++#endif ++ mb_set_bits(bitmap_bh->b_data, ac.ac_b_ex.fe_start, ac.ac_b_ex.fe_len); ++ ++ spin_lock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group)); ++ gdp->bg_free_blocks_count = ++ cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) ++ - ac.ac_b_ex.fe_len); ++ spin_unlock(sb_bgl_lock(sbi, ac.ac_b_ex.fe_group)); ++ percpu_counter_mod(&sbi->s_freeblocks_counter, - ac.ac_b_ex.fe_len); ++ ++ err = ext3_journal_dirty_metadata(handle, bitmap_bh); ++ if (err) ++ goto out_err; ++ err = ext3_journal_dirty_metadata(handle, gdp_bh); ++ if (err) ++ goto out_err; ++ ++ sb->s_dirt = 1; ++ *errp = 0; ++ brelse(bitmap_bh); ++ ++ /* drop non-allocated, but dquote'd blocks */ ++ J_ASSERT(*len >= ac.ac_b_ex.fe_len); ++ DQUOT_FREE_BLOCK(inode, *len - ac.ac_b_ex.fe_len); ++ ++ *len = ac.ac_b_ex.fe_len; ++ J_ASSERT(*len > 0); ++ J_ASSERT(block != 0); ++ goto out; ++ ++out_err: ++ /* if we've already allocated something, roll it back */ ++ if (ac.ac_status == AC_STATUS_FOUND) { ++ /* FIXME: free blocks here */ ++ } ++ ++ DQUOT_FREE_BLOCK(inode, *len); ++ brelse(bitmap_bh); ++ *errp = err; ++ block = 0; ++out: ++ if (!(flags & EXT3_MB_HINT_RESERVED)) { ++ /* block wasn't reserved before and we reserved it ++ * at the beginning of allocation. it doesn't matter ++ * whether we allocated anything or we failed: time ++ * to release reservation. NOTE: because I expect ++ * any multiblock request from delayed allocation ++ * path only, here is single block always */ ++ ext3_mb_release_blocks(sb, 1); ++ } ++ ++ if (unlikely(ext3_mb_stats) && ac.ac_g_ex.fe_len > 1) { ++ atomic_inc(&sbi->s_bal_reqs); ++ atomic_add(*len, &sbi->s_bal_allocated); ++ if (*len >= ac.ac_g_ex.fe_len) ++ atomic_inc(&sbi->s_bal_success); ++ atomic_add(ac.ac_found, &sbi->s_bal_ex_scanned); ++ if (ac.ac_g_ex.fe_start == ac.ac_b_ex.fe_start && ++ ac.ac_g_ex.fe_group == ac.ac_b_ex.fe_group) ++ atomic_inc(&sbi->s_bal_goals); ++ if (ac.ac_found > ext3_mb_max_to_scan) ++ atomic_inc(&sbi->s_bal_breaks); ++ } ++ ++ ext3_mb_store_history(sb, &ac); ++ ++ return block; ++} ++EXPORT_SYMBOL(ext3_mb_new_blocks); ++ ++#ifdef EXT3_MB_HISTORY ++struct ext3_mb_proc_session { ++ struct ext3_mb_history *history; ++ struct super_block *sb; ++ int start; ++ int max; ++}; ++ ++static void *ext3_mb_history_skip_empty(struct ext3_mb_proc_session *s, ++ struct ext3_mb_history *hs, ++ int first) ++{ ++ if (hs == s->history + s->max) ++ hs = s->history; ++ if (!first && hs == s->history + s->start) ++ return NULL; ++ while (hs->goal.fe_len == 0) { ++ hs++; ++ if (hs == s->history + s->max) ++ hs = s->history; ++ if (hs == s->history + s->start) ++ return NULL; ++ } ++ return hs; ++} ++ ++static void *ext3_mb_seq_history_start(struct seq_file *seq, loff_t *pos) ++{ ++ struct ext3_mb_proc_session *s = seq->private; ++ struct ext3_mb_history *hs; ++ int l = *pos; ++ ++ if (l == 0) ++ return SEQ_START_TOKEN; ++ hs = ext3_mb_history_skip_empty(s, s->history + s->start, 1); ++ if (!hs) ++ return NULL; ++ while (--l && (hs = ext3_mb_history_skip_empty(s, ++hs, 0)) != NULL); ++ return hs; ++} ++ ++static void *ext3_mb_seq_history_next(struct seq_file *seq, void *v, loff_t *pos) ++{ ++ struct ext3_mb_proc_session *s = seq->private; ++ struct ext3_mb_history *hs = v; ++ ++ ++*pos; ++ if (v == SEQ_START_TOKEN) ++ return ext3_mb_history_skip_empty(s, s->history + s->start, 1); ++ else ++ return ext3_mb_history_skip_empty(s, ++hs, 0); ++} ++ ++static int ext3_mb_seq_history_show(struct seq_file *seq, void *v) ++{ ++ struct ext3_mb_history *hs = v; ++ char buf[20], buf2[20]; ++ ++ if (v == SEQ_START_TOKEN) { ++ seq_printf(seq, "%-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", ++ "goal", "result", "found", "grps", "cr", "merge", ++ "tail", "broken"); ++ return 0; ++ } ++ ++ sprintf(buf, "%u/%u/%u", hs->goal.fe_group, ++ hs->goal.fe_start, hs->goal.fe_len); ++ sprintf(buf2, "%u/%u/%u", hs->result.fe_group, ++ hs->result.fe_start, hs->result.fe_len); ++ seq_printf(seq, "%-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", buf, ++ buf2, hs->found, hs->groups, hs->cr, ++ hs->merged ? "M" : "", hs->tail, ++ hs->buddy ? 1 << hs->buddy : 0); ++ return 0; ++} ++ ++static void ext3_mb_seq_history_stop(struct seq_file *seq, void *v) ++{ ++} ++ ++static struct seq_operations ext3_mb_seq_history_ops = { ++ .start = ext3_mb_seq_history_start, ++ .next = ext3_mb_seq_history_next, ++ .stop = ext3_mb_seq_history_stop, ++ .show = ext3_mb_seq_history_show, ++}; ++ ++static int ext3_mb_seq_history_open(struct inode *inode, struct file *file) ++{ ++ struct super_block *sb = PDE(inode)->data; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct ext3_mb_proc_session *s; ++ int rc, size; ++ ++ s = kmalloc(sizeof(*s), GFP_KERNEL); ++ if (s == NULL) ++ return -EIO; ++ size = sizeof(struct ext3_mb_history) * sbi->s_mb_history_max; ++ s->history = kmalloc(size, GFP_KERNEL); ++ if (s == NULL) { ++ kfree(s); ++ return -EIO; ++ } ++ ++ spin_lock(&sbi->s_mb_history_lock); ++ memcpy(s->history, sbi->s_mb_history, size); ++ s->max = sbi->s_mb_history_max; ++ s->start = sbi->s_mb_history_cur % s->max; ++ spin_unlock(&sbi->s_mb_history_lock); ++ ++ rc = seq_open(file, &ext3_mb_seq_history_ops); ++ if (rc == 0) { ++ struct seq_file *m = (struct seq_file *)file->private_data; ++ m->private = s; ++ } else { ++ kfree(s->history); ++ kfree(s); ++ } ++ return rc; ++ ++} ++ ++static int ext3_mb_seq_history_release(struct inode *inode, struct file *file) ++{ ++ struct seq_file *seq = (struct seq_file *)file->private_data; ++ struct ext3_mb_proc_session *s = seq->private; ++ kfree(s->history); ++ kfree(s); ++ return seq_release(inode, file); ++} ++ ++static struct file_operations ext3_mb_seq_history_fops = { ++ .owner = THIS_MODULE, ++ .open = ext3_mb_seq_history_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = ext3_mb_seq_history_release, ++}; ++ ++static void ext3_mb_history_release(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ char name[64]; ++ ++ snprintf(name, sizeof(name) - 1, "%s", bdevname(sb->s_bdev, name)); ++ remove_proc_entry("mb_history", sbi->s_mb_proc); ++ remove_proc_entry(name, proc_root_ext3); ++ ++ if (sbi->s_mb_history) ++ kfree(sbi->s_mb_history); ++} ++ ++static void ext3_mb_history_init(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ char name[64]; ++ int i; ++ ++ snprintf(name, sizeof(name) - 1, "%s", bdevname(sb->s_bdev, name)); ++ sbi->s_mb_proc = proc_mkdir(name, proc_root_ext3); ++ if (sbi->s_mb_proc != NULL) { ++ struct proc_dir_entry *p; ++ p = create_proc_entry("mb_history", S_IRUGO, sbi->s_mb_proc); ++ if (p) { ++ p->proc_fops = &ext3_mb_seq_history_fops; ++ p->data = sb; ++ } ++ } ++ ++ sbi->s_mb_history_max = 1000; ++ sbi->s_mb_history_cur = 0; ++ spin_lock_init(&sbi->s_mb_history_lock); ++ i = sbi->s_mb_history_max * sizeof(struct ext3_mb_history); ++ sbi->s_mb_history = kmalloc(i, GFP_KERNEL); ++ memset(sbi->s_mb_history, 0, i); ++ /* if we can't allocate history, then we simple won't use it */ ++} ++ ++static void ++ext3_mb_store_history(struct super_block *sb, struct ext3_allocation_context *ac) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct ext3_mb_history h; ++ ++ if (likely(sbi->s_mb_history == NULL)) ++ return; ++ ++ h.goal = ac->ac_g_ex; ++ h.result = ac->ac_b_ex; ++ h.found = ac->ac_found; ++ h.cr = ac->ac_criteria; ++ h.groups = ac->ac_groups_scanned; ++ h.tail = ac->ac_tail; ++ h.buddy = ac->ac_buddy; ++ h.merged = 0; ++ if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start && ++ ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group) ++ h.merged = 1; ++ ++ spin_lock(&sbi->s_mb_history_lock); ++ memcpy(sbi->s_mb_history + sbi->s_mb_history_cur, &h, sizeof(h)); ++ if (++sbi->s_mb_history_cur >= sbi->s_mb_history_max) ++ sbi->s_mb_history_cur = 0; ++ spin_unlock(&sbi->s_mb_history_lock); ++} ++ ++#else ++#define ext3_mb_history_release(sb) ++#define ext3_mb_history_init(sb) ++#endif ++ ++int ext3_mb_init_backend(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int i, len; ++ ++ len = sizeof(struct ext3_buddy_group_blocks *) * sbi->s_groups_count; ++ sbi->s_group_info = kmalloc(len, GFP_KERNEL); ++ if (sbi->s_group_info == NULL) { ++ printk(KERN_ERR "EXT3-fs: can't allocate mem for buddy\n"); ++ return -ENOMEM; ++ } ++ memset(sbi->s_group_info, 0, len); ++ ++ sbi->s_buddy_cache = new_inode(sb); ++ if (sbi->s_buddy_cache == NULL) { ++ printk(KERN_ERR "EXT3-fs: can't get new inode\n"); ++ kfree(sbi->s_group_info); ++ return -ENOMEM; ++ } ++ ++ /* ++ * calculate needed size. if change bb_counters size, ++ * don't forget about ext3_mb_generate_buddy() ++ */ ++ len = sizeof(struct ext3_group_info); ++ len += sizeof(unsigned short) * (sb->s_blocksize_bits + 2); ++ for (i = 0; i < sbi->s_groups_count; i++) { ++ struct ext3_group_desc * desc; ++ ++ sbi->s_group_info[i] = kmalloc(len, GFP_KERNEL); ++ if (sbi->s_group_info[i] == NULL) { ++ printk(KERN_ERR "EXT3-fs: cant allocate mem for buddy\n"); ++ goto err_out; ++ } ++ desc = ext3_get_group_desc(sb, i, NULL); ++ if (desc == NULL) { ++ printk(KERN_ERR "EXT3-fs: cant read descriptor %u\n", i); ++ goto err_out; ++ } ++ memset(sbi->s_group_info[i], 0, len); ++ set_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, ++ &sbi->s_group_info[i]->bb_state); ++ sbi->s_group_info[i]->bb_free = ++ le16_to_cpu(desc->bg_free_blocks_count); ++ } ++ ++ return 0; ++ ++err_out: ++ while (--i >= 0) ++ kfree(sbi->s_group_info[i]); ++ iput(sbi->s_buddy_cache); ++ ++ return -ENOMEM; ++} ++ ++int ext3_mb_init(struct super_block *sb, int needs_recovery) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct inode *root = sb->s_root->d_inode; ++ unsigned i, offset, max; ++ struct dentry *dentry; ++ ++ if (!test_opt(sb, MBALLOC)) ++ return 0; ++ ++ i = (sb->s_blocksize_bits + 2) * sizeof(unsigned short); ++ ++ sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL); ++ if (sbi->s_mb_offsets == NULL) { ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ return -ENOMEM; ++ } ++ sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL); ++ if (sbi->s_mb_maxs == NULL) { ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ kfree(sbi->s_mb_maxs); ++ return -ENOMEM; ++ } ++ ++ /* order 0 is regular bitmap */ ++ sbi->s_mb_maxs[0] = sb->s_blocksize << 3; ++ sbi->s_mb_offsets[0] = 0; ++ ++ i = 1; ++ offset = 0; ++ max = sb->s_blocksize << 2; ++ do { ++ sbi->s_mb_offsets[i] = offset; ++ sbi->s_mb_maxs[i] = max; ++ offset += 1 << (sb->s_blocksize_bits - i); ++ max = max >> 1; ++ i++; ++ } while (i <= sb->s_blocksize_bits + 1); ++ ++ ++ /* init file for buddy data */ ++ if ((i = ext3_mb_init_backend(sb))) { ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ kfree(sbi->s_mb_offsets); ++ kfree(sbi->s_mb_maxs); ++ return i; ++ } ++ ++ spin_lock_init(&sbi->s_reserve_lock); ++ spin_lock_init(&sbi->s_md_lock); ++ INIT_LIST_HEAD(&sbi->s_active_transaction); ++ INIT_LIST_HEAD(&sbi->s_closed_transaction); ++ INIT_LIST_HEAD(&sbi->s_committed_transaction); ++ spin_lock_init(&sbi->s_bal_lock); ++ ++ /* remove old on-disk buddy file */ ++ mutex_lock(&root->i_mutex); ++ dentry = lookup_one_len(".buddy", sb->s_root, strlen(".buddy")); ++ if (dentry->d_inode != NULL) { ++ i = vfs_unlink(root, dentry); ++ if (i != 0) ++ printk("EXT3-fs: can't remove .buddy file: %d\n", i); ++ } ++ dput(dentry); ++ mutex_unlock(&root->i_mutex); ++ ++ ext3_mb_history_init(sb); ++ ++ printk("EXT3-fs: mballoc enabled\n"); ++ return 0; ++} ++ ++int ext3_mb_release(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int i; ++ ++ if (!test_opt(sb, MBALLOC)) ++ return 0; ++ ++ /* release freed, non-committed blocks */ ++ spin_lock(&sbi->s_md_lock); ++ list_splice_init(&sbi->s_closed_transaction, ++ &sbi->s_committed_transaction); ++ list_splice_init(&sbi->s_active_transaction, ++ &sbi->s_committed_transaction); ++ spin_unlock(&sbi->s_md_lock); ++ ext3_mb_free_committed_blocks(sb); ++ ++ if (sbi->s_group_info) { ++ for (i = 0; i < sbi->s_groups_count; i++) { ++ if (sbi->s_group_info[i] == NULL) ++ continue; ++ kfree(sbi->s_group_info[i]); ++ } ++ kfree(sbi->s_group_info); ++ } ++ if (sbi->s_mb_offsets) ++ kfree(sbi->s_mb_offsets); ++ if (sbi->s_mb_maxs) ++ kfree(sbi->s_mb_maxs); ++ if (sbi->s_buddy_cache) ++ iput(sbi->s_buddy_cache); ++ if (sbi->s_blocks_reserved) ++ printk("ext3-fs: %ld blocks being reserved at umount!\n", ++ sbi->s_blocks_reserved); ++ if (ext3_mb_stats) { ++ printk("EXT3-fs: mballoc: %u blocks %u reqs (%u success)\n", ++ atomic_read(&sbi->s_bal_allocated), ++ atomic_read(&sbi->s_bal_reqs), ++ atomic_read(&sbi->s_bal_success)); ++ printk("EXT3-fs: mballoc: %u extents scanned, %u goal hits, " ++ "%u 2^N hits, %u breaks\n", ++ atomic_read(&sbi->s_bal_ex_scanned), ++ atomic_read(&sbi->s_bal_goals), ++ atomic_read(&sbi->s_bal_2orders), ++ atomic_read(&sbi->s_bal_breaks)); ++ printk("EXT3-fs: mballoc: %lu generated and it took %Lu\n", ++ sbi->s_mb_buddies_generated++, ++ sbi->s_mb_generation_time); ++ } ++ ++ ext3_mb_history_release(sb); ++ ++ return 0; ++} ++ ++void ext3_mb_free_committed_blocks(struct super_block *sb) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int err, i, count = 0, count2 = 0; ++ struct ext3_free_metadata *md; ++ struct ext3_buddy e3b; ++ ++ if (list_empty(&sbi->s_committed_transaction)) ++ return; ++ ++ /* there is committed blocks to be freed yet */ ++ do { ++ /* get next array of blocks */ ++ md = NULL; ++ spin_lock(&sbi->s_md_lock); ++ if (!list_empty(&sbi->s_committed_transaction)) { ++ md = list_entry(sbi->s_committed_transaction.next, ++ struct ext3_free_metadata, list); ++ list_del(&md->list); ++ } ++ spin_unlock(&sbi->s_md_lock); ++ ++ if (md == NULL) ++ break; ++ ++ mb_debug("gonna free %u blocks in group %u (0x%p):", ++ md->num, md->group, md); ++ ++ err = ext3_mb_load_buddy(sb, md->group, &e3b); ++ BUG_ON(err != 0); ++ ++ /* there are blocks to put in buddy to make them really free */ ++ count += md->num; ++ count2++; ++ ext3_lock_group(sb, md->group); ++ for (i = 0; i < md->num; i++) { ++ mb_debug(" %u", md->blocks[i]); ++ mb_free_blocks(&e3b, md->blocks[i], 1); ++ } ++ mb_debug("\n"); ++ ext3_unlock_group(sb, md->group); ++ ++ /* balance refcounts from ext3_mb_free_metadata() */ ++ page_cache_release(e3b.bd_buddy_page); ++ page_cache_release(e3b.bd_bitmap_page); ++ ++ kfree(md); ++ ext3_mb_release_desc(&e3b); ++ ++ } while (md); ++ mb_debug("freed %u blocks in %u structures\n", count, count2); ++} ++ ++void ext3_mb_poll_new_transaction(struct super_block *sb, handle_t *handle) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ ++ if (sbi->s_last_transaction == handle->h_transaction->t_tid) ++ return; ++ ++ /* new transaction! time to close last one and free blocks for ++ * committed transaction. we know that only transaction can be ++ * active, so previos transaction can be being logged and we ++ * know that transaction before previous is known to be already ++ * logged. this means that now we may free blocks freed in all ++ * transactions before previous one. hope I'm clear enough ... */ ++ ++ spin_lock(&sbi->s_md_lock); ++ if (sbi->s_last_transaction != handle->h_transaction->t_tid) { ++ mb_debug("new transaction %lu, old %lu\n", ++ (unsigned long) handle->h_transaction->t_tid, ++ (unsigned long) sbi->s_last_transaction); ++ list_splice_init(&sbi->s_closed_transaction, ++ &sbi->s_committed_transaction); ++ list_splice_init(&sbi->s_active_transaction, ++ &sbi->s_closed_transaction); ++ sbi->s_last_transaction = handle->h_transaction->t_tid; ++ } ++ spin_unlock(&sbi->s_md_lock); ++ ++ ext3_mb_free_committed_blocks(sb); ++} ++ ++int ext3_mb_free_metadata(handle_t *handle, struct ext3_buddy *e3b, ++ int group, int block, int count) ++{ ++ struct ext3_group_info *db = e3b->bd_info; ++ struct super_block *sb = e3b->bd_sb; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ struct ext3_free_metadata *md; ++ int i; ++ ++ J_ASSERT(e3b->bd_bitmap_page != NULL); ++ J_ASSERT(e3b->bd_buddy_page != NULL); ++ ++ ext3_lock_group(sb, group); ++ for (i = 0; i < count; i++) { ++ md = db->bb_md_cur; ++ if (md && db->bb_tid != handle->h_transaction->t_tid) { ++ db->bb_md_cur = NULL; ++ md = NULL; ++ } ++ ++ if (md == NULL) { ++ ext3_unlock_group(sb, group); ++ md = kmalloc(sizeof(*md), GFP_KERNEL); ++ if (md == NULL) ++ return -ENOMEM; ++ md->num = 0; ++ md->group = group; ++ ++ ext3_lock_group(sb, group); ++ if (db->bb_md_cur == NULL) { ++ spin_lock(&sbi->s_md_lock); ++ list_add(&md->list, &sbi->s_active_transaction); ++ spin_unlock(&sbi->s_md_lock); ++ /* protect buddy cache from being freed, ++ * otherwise we'll refresh it from ++ * on-disk bitmap and lose not-yet-available ++ * blocks */ ++ page_cache_get(e3b->bd_buddy_page); ++ page_cache_get(e3b->bd_bitmap_page); ++ db->bb_md_cur = md; ++ db->bb_tid = handle->h_transaction->t_tid; ++ mb_debug("new md 0x%p for group %u\n", ++ md, md->group); ++ } else { ++ kfree(md); ++ md = db->bb_md_cur; ++ } ++ } ++ ++ BUG_ON(md->num >= EXT3_BB_MAX_BLOCKS); ++ md->blocks[md->num] = block + i; ++ md->num++; ++ if (md->num == EXT3_BB_MAX_BLOCKS) { ++ /* no more space, put full container on a sb's list */ ++ db->bb_md_cur = NULL; ++ } ++ } ++ ext3_unlock_group(sb, group); ++ return 0; ++} ++ ++void ext3_mb_free_blocks(handle_t *handle, struct inode *inode, ++ unsigned long block, unsigned long count, ++ int metadata, int *freed) ++{ ++ struct buffer_head *bitmap_bh = NULL; ++ struct ext3_group_desc *gdp; ++ struct ext3_super_block *es; ++ unsigned long bit, overflow; ++ struct buffer_head *gd_bh; ++ unsigned long block_group; ++ struct ext3_sb_info *sbi; ++ struct super_block *sb; ++ struct ext3_buddy e3b; ++ int err = 0, ret; ++ ++ *freed = 0; ++ sb = inode->i_sb; ++ if (!sb) { ++ printk ("ext3_free_blocks: nonexistent device"); ++ return; ++ } ++ ++ ext3_mb_poll_new_transaction(sb, handle); ++ ++ sbi = EXT3_SB(sb); ++ es = EXT3_SB(sb)->s_es; ++ if (block < le32_to_cpu(es->s_first_data_block) || ++ block + count < block || ++ block + count > le32_to_cpu(es->s_blocks_count)) { ++ ext3_error (sb, "ext3_free_blocks", ++ "Freeing blocks not in datazone - " ++ "block = %lu, count = %lu", block, count); ++ goto error_return; ++ } ++ ++ ext3_debug("freeing block %lu\n", block); ++ ++do_more: ++ overflow = 0; ++ block_group = (block - le32_to_cpu(es->s_first_data_block)) / ++ EXT3_BLOCKS_PER_GROUP(sb); ++ bit = (block - le32_to_cpu(es->s_first_data_block)) % ++ EXT3_BLOCKS_PER_GROUP(sb); ++ /* ++ * Check to see if we are freeing blocks across a group ++ * boundary. ++ */ ++ if (bit + count > EXT3_BLOCKS_PER_GROUP(sb)) { ++ overflow = bit + count - EXT3_BLOCKS_PER_GROUP(sb); ++ count -= overflow; ++ } ++ brelse(bitmap_bh); ++ bitmap_bh = read_block_bitmap(sb, block_group); ++ if (!bitmap_bh) ++ goto error_return; ++ gdp = ext3_get_group_desc (sb, block_group, &gd_bh); ++ if (!gdp) ++ goto error_return; ++ ++ if (in_range (le32_to_cpu(gdp->bg_block_bitmap), block, count) || ++ in_range (le32_to_cpu(gdp->bg_inode_bitmap), block, count) || ++ in_range (block, le32_to_cpu(gdp->bg_inode_table), ++ EXT3_SB(sb)->s_itb_per_group) || ++ in_range (block + count - 1, le32_to_cpu(gdp->bg_inode_table), ++ EXT3_SB(sb)->s_itb_per_group)) ++ ext3_error (sb, "ext3_free_blocks", ++ "Freeing blocks in system zones - " ++ "Block = %lu, count = %lu", ++ block, count); ++ ++ BUFFER_TRACE(bitmap_bh, "getting write access"); ++ err = ext3_journal_get_write_access(handle, bitmap_bh); ++ if (err) ++ goto error_return; ++ ++ /* ++ * We are about to modify some metadata. Call the journal APIs ++ * to unshare ->b_data if a currently-committing transaction is ++ * using it ++ */ ++ BUFFER_TRACE(gd_bh, "get_write_access"); ++ err = ext3_journal_get_write_access(handle, gd_bh); ++ if (err) ++ goto error_return; ++ ++ err = ext3_mb_load_buddy(sb, block_group, &e3b); ++ if (err) ++ goto error_return; ++ ++#ifdef AGGRESSIVE_CHECK ++ { ++ int i; ++ for (i = 0; i < count; i++) ++ J_ASSERT(mb_test_bit(bit + i, bitmap_bh->b_data)); ++ } ++#endif ++ mb_clear_bits(bitmap_bh->b_data, bit, count); ++ ++ /* We dirtied the bitmap block */ ++ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); ++ err = ext3_journal_dirty_metadata(handle, bitmap_bh); ++ ++ if (metadata) { ++ /* blocks being freed are metadata. these blocks shouldn't ++ * be used until this transaction is committed */ ++ ext3_mb_free_metadata(handle, &e3b, block_group, bit, count); ++ } else { ++ ext3_lock_group(sb, block_group); ++ mb_free_blocks(&e3b, bit, count); ++ ext3_unlock_group(sb, block_group); ++ } ++ ++ spin_lock(sb_bgl_lock(sbi, block_group)); ++ gdp->bg_free_blocks_count = ++ cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) + count); ++ spin_unlock(sb_bgl_lock(sbi, block_group)); ++ percpu_counter_mod(&sbi->s_freeblocks_counter, count); ++ ++ ext3_mb_release_desc(&e3b); ++ ++ *freed = count; ++ ++ /* And the group descriptor block */ ++ BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); ++ ret = ext3_journal_dirty_metadata(handle, gd_bh); ++ if (!err) err = ret; ++ ++ if (overflow && !err) { ++ block += count; ++ count = overflow; ++ goto do_more; ++ } ++ sb->s_dirt = 1; ++error_return: ++ brelse(bitmap_bh); ++ ext3_std_error(sb, err); ++ return; ++} ++ ++int ext3_mb_reserve_blocks(struct super_block *sb, int blocks) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ int free, ret = -ENOSPC; ++ ++ BUG_ON(blocks < 0); ++ spin_lock(&sbi->s_reserve_lock); ++ free = percpu_counter_read_positive(&sbi->s_freeblocks_counter); ++ if (blocks <= free - sbi->s_blocks_reserved) { ++ sbi->s_blocks_reserved += blocks; ++ ret = 0; ++ } ++ spin_unlock(&sbi->s_reserve_lock); ++ return ret; ++} ++ ++void ext3_mb_release_blocks(struct super_block *sb, int blocks) ++{ ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ ++ BUG_ON(blocks < 0); ++ spin_lock(&sbi->s_reserve_lock); ++ sbi->s_blocks_reserved -= blocks; ++ WARN_ON(sbi->s_blocks_reserved < 0); ++ if (sbi->s_blocks_reserved < 0) ++ sbi->s_blocks_reserved = 0; ++ spin_unlock(&sbi->s_reserve_lock); ++} ++ ++int ext3_new_block(handle_t *handle, struct inode *inode, ++ unsigned long goal, int *errp) ++{ ++ int ret, len; ++ ++ if (!test_opt(inode->i_sb, MBALLOC)) { ++ ret = ext3_new_block_old(handle, inode, goal, errp); ++ goto out; ++ } ++ len = 1; ++ ret = ext3_mb_new_blocks(handle, inode, goal, &len, 0, errp); ++out: ++ return ret; ++} ++ ++ ++void ext3_free_blocks(handle_t *handle, struct inode * inode, ++ unsigned long block, unsigned long count, int metadata) ++{ ++ struct super_block *sb; ++ int freed; ++ ++ sb = inode->i_sb; ++ if (!test_opt(sb, MBALLOC)) ++ ext3_free_blocks_sb(handle, sb, block, count, &freed); ++ else ++ ext3_mb_free_blocks(handle, inode, block, count, metadata, &freed); ++ if (freed) ++ DQUOT_FREE_BLOCK(inode, freed); ++ return; ++} ++ ++#define EXT3_ROOT "ext3" ++#define EXT3_MB_STATS_NAME "mb_stats" ++#define EXT3_MB_MAX_TO_SCAN_NAME "mb_max_to_scan" ++#define EXT3_MB_MIN_TO_SCAN_NAME "mb_min_to_scan" ++ ++static int ext3_mb_stats_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_stats); ++ *start = page; ++ return len; ++} ++ ++static int ext3_mb_stats_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3: %s string to long, max %u bytes\n", ++ EXT3_MB_STATS_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ ext3_mb_stats = (simple_strtol(str, NULL, 0) != 0); ++ return count; ++} ++ ++static int ext3_mb_max_to_scan_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_max_to_scan); ++ *start = page; ++ return len; ++} ++ ++static int ext3_mb_max_to_scan_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ long value; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3: %s string to long, max %u bytes\n", ++ EXT3_MB_MAX_TO_SCAN_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ value = simple_strtol(str, NULL, 0); ++ if (value <= 0) ++ return -ERANGE; ++ ++ ext3_mb_max_to_scan = value; ++ ++ return count; ++} ++ ++static int ext3_mb_min_to_scan_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_min_to_scan); ++ *start = page; ++ return len; ++} ++ ++static int ext3_mb_min_to_scan_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ long value; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3: %s string to long, max %u bytes\n", ++ EXT3_MB_MIN_TO_SCAN_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ value = simple_strtol(str, NULL, 0); ++ if (value <= 0) ++ return -ERANGE; ++ ++ ext3_mb_min_to_scan = value; ++ ++ return count; ++} ++ ++int __init init_ext3_proc(void) ++{ ++ struct proc_dir_entry *proc_ext3_mb_stats; ++ struct proc_dir_entry *proc_ext3_mb_max_to_scan; ++ struct proc_dir_entry *proc_ext3_mb_min_to_scan; ++ ++ proc_root_ext3 = proc_mkdir(EXT3_ROOT, proc_root_fs); ++ if (proc_root_ext3 == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", EXT3_ROOT); ++ return -EIO; ++ } ++ ++ /* Initialize EXT3_MB_STATS_NAME */ ++ proc_ext3_mb_stats = create_proc_entry(EXT3_MB_STATS_NAME, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_stats == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ EXT3_MB_STATS_NAME); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_stats->data = NULL; ++ proc_ext3_mb_stats->read_proc = ext3_mb_stats_read; ++ proc_ext3_mb_stats->write_proc = ext3_mb_stats_write; ++ ++ /* Initialize EXT3_MAX_TO_SCAN_NAME */ ++ proc_ext3_mb_max_to_scan = create_proc_entry( ++ EXT3_MB_MAX_TO_SCAN_NAME, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_max_to_scan == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ EXT3_MB_MAX_TO_SCAN_NAME); ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_max_to_scan->data = NULL; ++ proc_ext3_mb_max_to_scan->read_proc = ext3_mb_max_to_scan_read; ++ proc_ext3_mb_max_to_scan->write_proc = ext3_mb_max_to_scan_write; ++ ++ /* Initialize EXT3_MIN_TO_SCAN_NAME */ ++ proc_ext3_mb_min_to_scan = create_proc_entry( ++ EXT3_MB_MIN_TO_SCAN_NAME, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_min_to_scan == NULL) { ++ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ EXT3_MB_MIN_TO_SCAN_NAME); ++ remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_min_to_scan->data = NULL; ++ proc_ext3_mb_min_to_scan->read_proc = ext3_mb_min_to_scan_read; ++ proc_ext3_mb_min_to_scan->write_proc = ext3_mb_min_to_scan_write; ++ ++ return 0; ++} ++ ++void exit_ext3_proc(void) ++{ ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++} ++ +Index: linux-stage/fs/ext3/extents.c +=================================================================== +--- linux-stage.orig/fs/ext3/extents.c 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/fs/ext3/extents.c 2006-07-16 02:29:49.000000000 +0800 +@@ -771,7 +771,7 @@ cleanup: + for (i = 0; i < depth; i++) { + if (!ablocks[i]) + continue; +- ext3_free_blocks(handle, tree->inode, ablocks[i], 1); ++ ext3_free_blocks(handle, tree->inode, ablocks[i], 1, 1); + } + } + kfree(ablocks); +@@ -1428,7 +1428,7 @@ int ext3_ext_rm_idx(handle_t *handle, st + path->p_idx->ei_leaf); + bh = sb_find_get_block(tree->inode->i_sb, path->p_idx->ei_leaf); + ext3_forget(handle, 1, tree->inode, bh, path->p_idx->ei_leaf); +- ext3_free_blocks(handle, tree->inode, path->p_idx->ei_leaf, 1); ++ ext3_free_blocks(handle, tree->inode, path->p_idx->ei_leaf, 1, 1); + return err; + } + +@@ -1913,10 +1913,12 @@ ext3_remove_blocks(struct ext3_extents_t + int needed = ext3_remove_blocks_credits(tree, ex, from, to); + handle_t *handle = ext3_journal_start(tree->inode, needed); + struct buffer_head *bh; +- int i; ++ int i, metadata = 0; + + if (IS_ERR(handle)) + return PTR_ERR(handle); ++ if (S_ISDIR(tree->inode->i_mode) || S_ISLNK(tree->inode->i_mode)) ++ metadata = 1; + if (from >= ex->ee_block && to == ex->ee_block + ex->ee_len - 1) { + /* tail removal */ + unsigned long num, start; +@@ -1928,7 +1930,7 @@ ext3_remove_blocks(struct ext3_extents_t + bh = sb_find_get_block(tree->inode->i_sb, start + i); + ext3_forget(handle, 0, tree->inode, bh, start + i); + } +- ext3_free_blocks(handle, tree->inode, start, num); ++ ext3_free_blocks(handle, tree->inode, start, num, metadata); + } else if (from == ex->ee_block && to <= ex->ee_block + ex->ee_len - 1) { + printk("strange request: removal %lu-%lu from %u:%u\n", + from, to, ex->ee_block, ex->ee_len); +Index: linux-stage/fs/ext3/xattr.c +=================================================================== +--- linux-stage.orig/fs/ext3/xattr.c 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/fs/ext3/xattr.c 2006-07-16 02:29:49.000000000 +0800 +@@ -484,7 +484,7 @@ ext3_xattr_release_block(handle_t *handl + ea_bdebug(bh, "refcount now=0; freeing"); + if (ce) + mb_cache_entry_free(ce); +- ext3_free_blocks(handle, inode, bh->b_blocknr, 1); ++ ext3_free_blocks(handle, inode, bh->b_blocknr, 1, 1); + get_bh(bh); + ext3_forget(handle, 1, inode, bh, bh->b_blocknr); + } else { +@@ -805,7 +805,7 @@ inserted: + new_bh = sb_getblk(sb, block); + if (!new_bh) { + getblk_failed: +- ext3_free_blocks(handle, inode, block, 1); ++ ext3_free_blocks(handle, inode, block, 1, 1); + error = -EIO; + goto cleanup; + } +Index: linux-stage/fs/ext3/balloc.c +=================================================================== +--- linux-stage.orig/fs/ext3/balloc.c 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/fs/ext3/balloc.c 2006-07-16 02:33:13.000000000 +0800 +@@ -79,7 +79,7 @@ struct ext3_group_desc * ext3_get_group_ + * + * Return buffer_head on success or NULL in case of failure. + */ +-static struct buffer_head * ++struct buffer_head * + read_block_bitmap(struct super_block *sb, unsigned int block_group) + { + struct ext3_group_desc * desc; +@@ -490,24 +490,6 @@ error_return: + return; + } + +-/* Free given blocks, update quota and i_blocks field */ +-void ext3_free_blocks(handle_t *handle, struct inode *inode, +- ext3_fsblk_t block, unsigned long count) +-{ +- struct super_block * sb; +- unsigned long dquot_freed_blocks; +- +- sb = inode->i_sb; +- if (!sb) { +- printk ("ext3_free_blocks: nonexistent device"); +- return; +- } +- ext3_free_blocks_sb(handle, sb, block, count, &dquot_freed_blocks); +- if (dquot_freed_blocks) +- DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); +- return; +-} +- + /* + * For ext3 allocations, we must not reuse any blocks which are + * allocated in the bitmap buffer's "last committed data" copy. This +@@ -1463,7 +1445,7 @@ out: + return 0; + } + +-ext3_fsblk_t ext3_new_block(handle_t *handle, struct inode *inode, ++ext3_fsblk_t ext3_new_block_old(handle_t *handle, struct inode *inode, + ext3_fsblk_t goal, int *errp) + { + unsigned long count = 1; +Index: linux-stage/fs/ext3/super.c +=================================================================== +--- linux-stage.orig/fs/ext3/super.c 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/fs/ext3/super.c 2006-07-16 02:29:49.000000000 +0800 +@@ -391,6 +391,7 @@ static void ext3_put_super (struct super + struct ext3_super_block *es = sbi->s_es; + int i; + ++ ext3_mb_release(sb); + ext3_ext_release(sb); + ext3_xattr_put_super(sb); + journal_destroy(sbi->s_journal); +@@ -641,7 +642,7 @@ enum { + Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, + Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, + Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, +- Opt_extents, Opt_extdebug, ++ Opt_extents, Opt_extdebug, Opt_mballoc, + Opt_grpquota + }; + +@@ -696,6 +697,7 @@ static match_table_t tokens = { + {Opt_iopen_nopriv, "iopen_nopriv"}, + {Opt_extents, "extents"}, + {Opt_extdebug, "extdebug"}, ++ {Opt_mballoc, "mballoc"}, + {Opt_barrier, "barrier=%u"}, + {Opt_err, NULL}, + {Opt_resize, "resize"}, +@@ -1047,6 +1049,9 @@ clear_qf_name: + case Opt_extdebug: + set_opt (sbi->s_mount_opt, EXTDEBUG); + break; ++ case Opt_mballoc: ++ set_opt (sbi->s_mount_opt, MBALLOC); ++ break; + default: + printk (KERN_ERR + "EXT3-fs: Unrecognized mount option \"%s\" " +@@ -1773,6 +1778,7 @@ static int ext3_fill_super (struct super + "writeback"); + + ext3_ext_init(sb); ++ ext3_mb_init(sb, needs_recovery); + lock_kernel(); + return 0; + +@@ -2712,7 +2718,13 @@ static struct file_system_type ext3_fs_t + + static int __init init_ext3_fs(void) + { +- int err = init_ext3_xattr(); ++ int err; ++ ++ err = init_ext3_proc(); ++ if (err) ++ return err; ++ ++ err = init_ext3_xattr(); + if (err) + return err; + err = init_inodecache(); +@@ -2734,6 +2746,7 @@ static void __exit exit_ext3_fs(void) + unregister_filesystem(&ext3_fs_type); + destroy_inodecache(); + exit_ext3_xattr(); ++ exit_ext3_proc(); + } + + int ext3_prep_san_write(struct inode *inode, long *blocks, +Index: linux-stage/fs/ext3/Makefile +=================================================================== +--- linux-stage.orig/fs/ext3/Makefile 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/fs/ext3/Makefile 2006-07-16 02:29:49.000000000 +0800 +@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT3_FS) += ext3.o + + ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ + ioctl.o namei.o super.o symlink.o hash.o resize.o \ +- extents.o ++ extents.o mballoc.o + + ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o + ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o +Index: linux-stage/include/linux/ext3_fs.h +=================================================================== +--- linux-stage.orig/include/linux/ext3_fs.h 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/include/linux/ext3_fs.h 2006-07-16 02:29:49.000000000 +0800 +@@ -53,6 +53,14 @@ + #define ext3_debug(f, a...) do {} while (0) + #endif + ++#define EXT3_MULTIBLOCK_ALLOCATOR 1 ++ ++#define EXT3_MB_HINT_MERGE 1 ++#define EXT3_MB_HINT_RESERVED 2 ++#define EXT3_MB_HINT_METADATA 4 ++#define EXT3_MB_HINT_FIRST 8 ++#define EXT3_MB_HINT_BEST 16 ++ + /* + * Special inodes numbers + */ +@@ -379,6 +387,7 @@ struct ext3_inode { + #define EXT3_MOUNT_IOPEN_NOPRIV 0x800000/* Make iopen world-readable */ + #define EXT3_MOUNT_EXTENTS 0x1000000/* Extents support */ + #define EXT3_MOUNT_EXTDEBUG 0x2000000/* Extents debug */ ++#define EXT3_MOUNT_MBALLOC 0x800000/* Buddy allocation support */ + + /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ + #ifndef clear_opt +@@ -749,12 +758,12 @@ ext3_group_first_block_no(struct super_b + /* balloc.c */ + extern int ext3_bg_has_super(struct super_block *sb, int group); + extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group); +-extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode, +- ext3_fsblk_t goal, int *errp); ++//extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode, ++// ext3_fsblk_t goal, int *errp); + extern ext3_fsblk_t ext3_new_blocks (handle_t *handle, struct inode *inode, + ext3_fsblk_t goal, unsigned long *count, int *errp); + extern void ext3_free_blocks (handle_t *handle, struct inode *inode, +- ext3_fsblk_t block, unsigned long count); ++ ext3_fsblk_t block, unsigned long count, int metadata); + extern void ext3_free_blocks_sb (handle_t *handle, struct super_block *sb, + ext3_fsblk_t block, unsigned long count, + unsigned long *pdquot_freed_blocks); +@@ -881,6 +890,17 @@ extern void ext3_extents_initialize_bloc + extern int ext3_ext_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg); + ++/* mballoc.c */ ++extern long ext3_mb_stats; ++extern long ext3_mb_max_to_scan; ++extern int ext3_mb_init(struct super_block *, int); ++extern int ext3_mb_release(struct super_block *); ++extern int ext3_mb_new_blocks(handle_t *, struct inode *, unsigned long, int *, int, int *); ++extern int ext3_mb_reserve_blocks(struct super_block *, int); ++extern void ext3_mb_release_blocks(struct super_block *, int); ++int __init init_ext3_proc(void); ++void exit_ext3_proc(void); ++ + #endif /* __KERNEL__ */ + + /* EXT3_IOC_CREATE_INUM at bottom of file (visible to kernel and user). */ +Index: linux-stage/include/linux/ext3_fs_sb.h +=================================================================== +--- linux-stage.orig/include/linux/ext3_fs_sb.h 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/include/linux/ext3_fs_sb.h 2006-07-16 02:29:49.000000000 +0800 +@@ -21,8 +21,14 @@ + #include + #include + #include ++#include + #endif + #include ++#include ++ ++struct ext3_buddy_group_blocks; ++struct ext3_mb_history; ++#define EXT3_BB_MAX_BLOCKS + + /* + * third extended-fs super-block data in memory +@@ -78,6 +84,38 @@ struct ext3_sb_info { + char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */ + int s_jquota_fmt; /* Format of quota to use */ + #endif ++ ++ /* for buddy allocator */ ++ struct ext3_group_info **s_group_info; ++ struct inode *s_buddy_cache; ++ long s_blocks_reserved; ++ spinlock_t s_reserve_lock; ++ struct list_head s_active_transaction; ++ struct list_head s_closed_transaction; ++ struct list_head s_committed_transaction; ++ spinlock_t s_md_lock; ++ tid_t s_last_transaction; ++ int s_mb_factor; ++ unsigned short *s_mb_offsets, *s_mb_maxs; ++ ++ /* history to debug policy */ ++ struct ext3_mb_history *s_mb_history; ++ int s_mb_history_cur; ++ int s_mb_history_max; ++ struct proc_dir_entry *s_mb_proc; ++ spinlock_t s_mb_history_lock; ++ ++ /* stats for buddy allocator */ ++ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ ++ atomic_t s_bal_success; /* we found long enough chunks */ ++ atomic_t s_bal_allocated; /* in blocks */ ++ atomic_t s_bal_ex_scanned; /* total extents scanned */ ++ atomic_t s_bal_goals; /* goal hits */ ++ atomic_t s_bal_breaks; /* too long searches */ ++ atomic_t s_bal_2orders; /* 2^order hits */ ++ spinlock_t s_bal_lock; ++ unsigned long s_mb_buddies_generated; ++ unsigned long long s_mb_generation_time; + }; + + #endif /* _LINUX_EXT3_FS_SB */ +Index: linux-stage/fs/ext3/inode.c +=================================================================== +--- linux-stage.orig/fs/ext3/inode.c 2006-07-16 02:29:43.000000000 +0800 ++++ linux-stage/fs/ext3/inode.c 2006-07-16 02:29:49.000000000 +0800 +@@ -562,7 +562,7 @@ static int ext3_alloc_blocks(handle_t *h + return ret; + failed_out: + for (i = 0; i @@ -72,13 +72,13 @@ Index: linux-2.6.9-full/include/linux/ext3_fs_sb.h /* * third extended-fs super-block data in memory -@@ -81,6 +87,38 @@ struct ext3_sb_info { +@@ -81,6 +87,43 @@ struct ext3_sb_info { char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */ int s_jquota_fmt; /* Format of quota to use */ #endif + + /* for buddy allocator */ -+ struct ext3_group_info **s_group_info; ++ struct ext3_group_info ***s_group_info; + struct inode *s_buddy_cache; + long s_blocks_reserved; + spinlock_t s_reserve_lock; @@ -89,6 +89,7 @@ Index: linux-2.6.9-full/include/linux/ext3_fs_sb.h + tid_t s_last_transaction; + int s_mb_factor; + unsigned short *s_mb_offsets, *s_mb_maxs; ++ unsigned long s_stripe; + + /* history to debug policy */ + struct ext3_mb_history *s_mb_history; @@ -109,13 +110,17 @@ Index: linux-2.6.9-full/include/linux/ext3_fs_sb.h + unsigned long s_mb_buddies_generated; + unsigned long long s_mb_generation_time; }; ++ ++#define EXT3_GROUP_INFO(sb, group) \ ++ EXT3_SB(sb)->s_group_info[(group) >> EXT3_DESC_PER_BLOCK_BITS(sb)] \ ++ [(group) & (EXT3_DESC_PER_BLOCK(sb) - 1)] #endif /* _LINUX_EXT3_FS_SB */ -Index: linux-2.6.9-full/fs/ext3/super.c +Index: linux-stage/fs/ext3/super.c =================================================================== ---- linux-2.6.9-full.orig/fs/ext3/super.c 2005-12-16 23:16:41.000000000 +0300 -+++ linux-2.6.9-full/fs/ext3/super.c 2005-12-16 23:16:42.000000000 +0300 -@@ -394,6 +394,7 @@ void ext3_put_super (struct super_block +--- linux-stage.orig/fs/ext3/super.c 2006-05-25 10:36:04.000000000 -0600 ++++ linux-stage/fs/ext3/super.c 2006-05-25 10:36:04.000000000 -0600 +@@ -394,6 +394,7 @@ void ext3_put_super (struct super_block struct ext3_super_block *es = sbi->s_es; int i; @@ -123,34 +128,45 @@ Index: linux-2.6.9-full/fs/ext3/super.c ext3_ext_release(sb); ext3_xattr_put_super(sb); journal_destroy(sbi->s_journal); -@@ -596,7 +597,7 @@ enum { - Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, +@@ -597,6 +598,7 @@ enum { Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, -- Opt_extents, Opt_extdebug, -+ Opt_extents, Opt_extdebug, Opt_mballoc, + Opt_extents, Opt_noextents, Opt_extdebug, ++ Opt_mballoc, Opt_nomballoc, Opt_stripe, }; static match_table_t tokens = { -@@ -647,6 +649,7 @@ static match_table_t tokens = { - {Opt_iopen_nopriv, "iopen_nopriv"}, +@@ -649,6 +651,9 @@ static match_table_t tokens = { {Opt_extents, "extents"}, + {Opt_noextents, "noextents"}, {Opt_extdebug, "extdebug"}, + {Opt_mballoc, "mballoc"}, ++ {Opt_nomballoc, "nomballoc"}, ++ {Opt_stripe, "stripe=%u"}, {Opt_barrier, "barrier=%u"}, {Opt_err, NULL}, {Opt_resize, "resize"}, -@@ -957,6 +960,9 @@ clear_qf_name: +@@ -962,6 +967,19 @@ static int parse_options (char * options case Opt_extdebug: set_opt (sbi->s_mount_opt, EXTDEBUG); break; + case Opt_mballoc: -+ set_opt (sbi->s_mount_opt, MBALLOC); ++ set_opt(sbi->s_mount_opt, MBALLOC); ++ break; ++ case Opt_nomballoc: ++ clear_opt(sbi->s_mount_opt, MBALLOC); ++ break; ++ case Opt_stripe: ++ if (match_int(&args[0], &option)) ++ return 0; ++ if (option < 0) ++ return 0; ++ sbi->s_stripe = option; + break; default: printk (KERN_ERR "EXT3-fs: Unrecognized mount option \"%s\" " -@@ -1646,6 +1652,7 @@ static int ext3_fill_super (struct super +@@ -1651,6 +1669,7 @@ static int ext3_fill_super (struct super ext3_count_dirs(sb)); ext3_ext_init(sb); @@ -158,7 +174,7 @@ Index: linux-2.6.9-full/fs/ext3/super.c return 0; -@@ -2428,7 +2435,13 @@ static struct file_system_type ext3_fs_t +@@ -2433,7 +2452,13 @@ static struct file_system_type ext3_fs_t static int __init init_ext3_fs(void) { @@ -173,7 +189,7 @@ Index: linux-2.6.9-full/fs/ext3/super.c if (err) return err; err = init_inodecache(); -@@ -2450,6 +2463,7 @@ static void __exit exit_ext3_fs(void) +@@ -2455,6 +2480,7 @@ static void __exit exit_ext3_fs(void) unregister_filesystem(&ext3_fs_type); destroy_inodecache(); exit_ext3_xattr(); @@ -181,11 +197,11 @@ Index: linux-2.6.9-full/fs/ext3/super.c } int ext3_prep_san_write(struct inode *inode, long *blocks, -Index: linux-2.6.9-full/fs/ext3/extents.c +Index: linux-stage/fs/ext3/extents.c =================================================================== ---- linux-2.6.9-full.orig/fs/ext3/extents.c 2005-12-16 23:16:41.000000000 +0300 -+++ linux-2.6.9-full/fs/ext3/extents.c 2005-12-16 23:16:42.000000000 +0300 -@@ -771,7 +771,7 @@ cleanup: +--- linux-stage.orig/fs/ext3/extents.c 2006-05-25 10:36:04.000000000 -0600 ++++ linux-stage/fs/ext3/extents.c 2006-05-25 10:36:04.000000000 -0600 +@@ -777,7 +777,7 @@ cleanup: for (i = 0; i < depth; i++) { if (!ablocks[i]) continue; @@ -194,7 +210,7 @@ Index: linux-2.6.9-full/fs/ext3/extents.c } } kfree(ablocks); -@@ -1428,7 +1428,7 @@ int ext3_ext_rm_idx(handle_t *handle, st +@@ -1434,7 +1434,7 @@ int ext3_ext_rm_idx(handle_t *handle, st path->p_idx->ei_leaf); bh = sb_find_get_block(tree->inode->i_sb, path->p_idx->ei_leaf); ext3_forget(handle, 1, tree->inode, bh, path->p_idx->ei_leaf); @@ -203,7 +219,7 @@ Index: linux-2.6.9-full/fs/ext3/extents.c return err; } -@@ -1913,10 +1913,12 @@ ext3_remove_blocks(struct ext3_extents_t +@@ -1919,10 +1919,12 @@ ext3_remove_blocks(struct ext3_extents_t int needed = ext3_remove_blocks_credits(tree, ex, from, to); handle_t *handle = ext3_journal_start(tree->inode, needed); struct buffer_head *bh; @@ -217,7 +233,7 @@ Index: linux-2.6.9-full/fs/ext3/extents.c if (from >= ex->ee_block && to == ex->ee_block + ex->ee_len - 1) { /* tail removal */ unsigned long num, start; -@@ -1928,7 +1930,7 @@ ext3_remove_blocks(struct ext3_extents_t +@@ -1934,7 +1936,7 @@ ext3_remove_blocks(struct ext3_extents_t bh = sb_find_get_block(tree->inode->i_sb, start + i); ext3_forget(handle, 0, tree->inode, bh, start + i); } @@ -226,10 +242,10 @@ Index: linux-2.6.9-full/fs/ext3/extents.c } else if (from == ex->ee_block && to <= ex->ee_block + ex->ee_len - 1) { printk("strange request: removal %lu-%lu from %u:%u\n", from, to, ex->ee_block, ex->ee_len); -Index: linux-2.6.9-full/fs/ext3/inode.c +Index: linux-stage/fs/ext3/inode.c =================================================================== ---- linux-2.6.9-full.orig/fs/ext3/inode.c 2005-12-16 23:16:41.000000000 +0300 -+++ linux-2.6.9-full/fs/ext3/inode.c 2005-12-16 23:16:42.000000000 +0300 +--- linux-stage.orig/fs/ext3/inode.c 2006-05-25 10:36:04.000000000 -0600 ++++ linux-stage/fs/ext3/inode.c 2006-05-25 10:36:04.000000000 -0600 @@ -572,7 +572,7 @@ static int ext3_alloc_branch(handle_t *h ext3_journal_forget(handle, branch[i].bh); } @@ -257,7 +273,7 @@ Index: linux-2.6.9-full/fs/ext3/inode.c } /** -@@ -2004,7 +2004,7 @@ static void ext3_free_branches(handle_t +@@ -2004,7 +2004,7 @@ static void ext3_free_branches(handle_t ext3_journal_test_restart(handle, inode); } @@ -266,10 +282,10 @@ Index: linux-2.6.9-full/fs/ext3/inode.c if (parent_bh) { /* -Index: linux-2.6.9-full/fs/ext3/balloc.c +Index: linux-stage/fs/ext3/balloc.c =================================================================== ---- linux-2.6.9-full.orig/fs/ext3/balloc.c 2005-10-27 21:44:24.000000000 +0400 -+++ linux-2.6.9-full/fs/ext3/balloc.c 2005-12-16 23:16:42.000000000 +0300 +--- linux-stage.orig/fs/ext3/balloc.c 2006-05-25 10:36:02.000000000 -0600 ++++ linux-stage/fs/ext3/balloc.c 2006-05-25 10:36:04.000000000 -0600 @@ -79,7 +79,7 @@ struct ext3_group_desc * ext3_get_group_ * * Return buffer_head on success or NULL in case of failure. @@ -279,7 +295,7 @@ Index: linux-2.6.9-full/fs/ext3/balloc.c read_block_bitmap(struct super_block *sb, unsigned int block_group) { struct ext3_group_desc * desc; -@@ -450,24 +450,6 @@ error_return: +@@ -451,24 +451,6 @@ return; } @@ -304,7 +320,7 @@ Index: linux-2.6.9-full/fs/ext3/balloc.c /* * For ext3 allocations, we must not reuse any blocks which are * allocated in the bitmap buffer's "last committed data" copy. This -@@ -1140,7 +1122,7 @@ int ext3_should_retry_alloc(struct super +@@ -1131,7 +1113,7 @@ * bitmap, and then for any free bit if that fails. * This function also updates quota and i_blocks field. */ @@ -313,10 +329,10 @@ Index: linux-2.6.9-full/fs/ext3/balloc.c unsigned long goal, int *errp) { struct buffer_head *bitmap_bh = NULL; -Index: linux-2.6.9-full/fs/ext3/xattr.c +Index: linux-stage/fs/ext3/xattr.c =================================================================== ---- linux-2.6.9-full.orig/fs/ext3/xattr.c 2005-12-16 23:16:40.000000000 +0300 -+++ linux-2.6.9-full/fs/ext3/xattr.c 2005-12-16 23:16:42.000000000 +0300 +--- linux-stage.orig/fs/ext3/xattr.c 2006-05-25 10:36:04.000000000 -0600 ++++ linux-stage/fs/ext3/xattr.c 2006-05-25 10:36:04.000000000 -0600 @@ -1281,7 +1281,7 @@ ext3_xattr_set_handle2(handle_t *handle, new_bh = sb_getblk(sb, block); if (!new_bh) { @@ -344,11 +360,11 @@ Index: linux-2.6.9-full/fs/ext3/xattr.c get_bh(bh); ext3_forget(handle, 1, inode, bh, EXT3_I(inode)->i_file_acl); } else { -Index: linux-2.6.9-full/fs/ext3/mballoc.c +Index: linux-stage/fs/ext3/mballoc.c =================================================================== ---- linux-2.6.9-full.orig/fs/ext3/mballoc.c 2005-12-16 17:46:19.148560250 +0300 -+++ linux-2.6.9-full/fs/ext3/mballoc.c 2005-12-17 00:10:15.000000000 +0300 -@@ -0,0 +1,2429 @@ +--- linux-stage.orig/fs/ext3/mballoc.c 2006-05-23 17:33:37.579436680 -0600 ++++ linux-stage/fs/ext3/mballoc.c 2006-05-25 10:59:14.000000000 -0600 +@@ -0,0 +1,2701 @@ +/* + * Copyright (c) 2003-2005, Cluster File Systems, Inc, info@clusterfs.com + * Written by Alex Tomas @@ -437,6 +453,12 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + +long ext3_mb_stats = 1; + ++/* ++ * for which requests use 2^N search using buddies ++ */ ++long ext3_mb_order2_reqs = 8; ++ ++ +#ifdef EXT3_BB_MAX_BLOCKS +#undef EXT3_BB_MAX_BLOCKS +#endif @@ -477,10 +499,10 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + + /* search goals */ + struct ext3_free_extent ac_g_ex; -+ ++ + /* the best found extent */ + struct ext3_free_extent ac_b_ex; -+ ++ + /* number of iterations done. we have to track to limit searching */ + unsigned long ac_ex_scanned; + __u16 ac_groups_scanned; @@ -502,6 +524,8 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c +struct ext3_mb_history { + struct ext3_free_extent goal; /* goal allocation */ + struct ext3_free_extent result; /* result allocation */ ++ unsigned pid; ++ unsigned ino; + __u16 found; /* how many extents have been found */ + __u16 groups; /* how many groups have been scanned */ + __u16 tail; /* what tail broke some buddy */ @@ -524,9 +548,9 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c +#define EXT3_MB_BUDDY(e3b) ((e3b)->bd_buddy) + +#ifndef EXT3_MB_HISTORY -+#define ext3_mb_store_history(sb,ac) ++#define ext3_mb_store_history(sb,ino,ac) +#else -+static void ext3_mb_store_history(struct super_block *, ++static void ext3_mb_store_history(struct super_block *, unsigned ino, + struct ext3_allocation_context *ac); +#endif + @@ -645,7 +669,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + if (mb_check_counter++ % 300 != 0) + return; + } -+ ++ + while (order > 1) { + buddy = mb_find_buddy(e3b, order, &max); + J_ASSERT(buddy); @@ -826,7 +850,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + sb = inode->i_sb; + blocksize = 1 << inode->i_blkbits; + blocks_per_page = PAGE_CACHE_SIZE / blocksize; -+ ++ + groups_per_page = blocks_per_page >> 1; + if (groups_per_page == 0) + groups_per_page = 1; @@ -841,9 +865,9 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + memset(bh, 0, i); + } else + bh = &bhs; -+ ++ + first_group = page->index * blocks_per_page / 2; -+ ++ + /* read all groups the page covers into the cache */ + for (i = 0; i < groups_per_page; i++) { + struct ext3_group_desc * desc; @@ -898,11 +922,11 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + mb_debug("put buddy for group %u in page %lu/%x\n", + group, page->index, i * blocksize); + memset(data, 0xff, blocksize); -+ EXT3_SB(sb)->s_group_info[group]->bb_fragments = 0; -+ memset(EXT3_SB(sb)->s_group_info[group]->bb_counters, 0, ++ EXT3_GROUP_INFO(sb, group)->bb_fragments = 0; ++ memset(EXT3_GROUP_INFO(sb, group)->bb_counters, 0, + sizeof(unsigned short)*(sb->s_blocksize_bits+2)); + ext3_mb_generate_buddy(sb, data, bitmap, -+ EXT3_SB(sb)->s_group_info[group]); ++ EXT3_GROUP_INFO(sb, group)); + } else { + /* this is block of bitmap */ + mb_debug("put bitmap for group %u in page %lu/%x\n", @@ -935,7 +959,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; + + e3b->bd_blkbits = sb->s_blocksize_bits; -+ e3b->bd_info = sbi->s_group_info[group]; ++ e3b->bd_info = EXT3_GROUP_INFO(sb, group); + e3b->bd_sb = sb; + e3b->bd_group = group; + e3b->bd_buddy_page = NULL; @@ -1011,14 +1035,14 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c +ext3_lock_group(struct super_block *sb, int group) +{ + bit_spin_lock(EXT3_GROUP_INFO_LOCKED_BIT, -+ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++ &EXT3_GROUP_INFO(sb, group)->bb_state); +} + +static inline void +ext3_unlock_group(struct super_block *sb, int group) +{ + bit_spin_unlock(EXT3_GROUP_INFO_LOCKED_BIT, -+ &EXT3_SB(sb)->s_group_info[group]->bb_state); ++ &EXT3_GROUP_INFO(sb, group)->bb_state); +} + +static int mb_find_order_for_block(struct ext3_buddy *e3b, int block) @@ -1148,7 +1172,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c +static int mb_find_extent(struct ext3_buddy *e3b, int order, int block, + int needed, struct ext3_free_extent *ex) +{ -+ int next, max, ord; ++ int next = block, max, ord; + void *buddy; + + J_ASSERT(ex != NULL); @@ -1173,6 +1197,11 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + ex->fe_start = block << order; + ex->fe_group = e3b->bd_group; + ++ /* calc difference from given start */ ++ next = next - ex->fe_start; ++ ex->fe_len -= next; ++ ex->fe_start += next; ++ + while (needed > ex->fe_len && (buddy = mb_find_buddy(e3b, order, &max))) { + + if (block + 1 >= max) @@ -1368,7 +1397,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + + ext3_lock_group(ac->ac_sb, group); + max = mb_find_extent(e3b, 0, ex.fe_start, ex.fe_len, &ex); -+ ++ + if (max > 0) { + ac->ac_b_ex = ex; + ext3_mb_use_best_found(ac, e3b); @@ -1385,6 +1414,8 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + struct ext3_buddy *e3b) +{ + int group = ac->ac_g_ex.fe_group, max, err; ++ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); ++ struct ext3_super_block *es = sbi->s_es; + struct ext3_free_extent ex; + + err = ext3_mb_load_buddy(ac->ac_sb, group, e3b); @@ -1393,9 +1424,27 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + + ext3_lock_group(ac->ac_sb, group); + max = mb_find_extent(e3b, 0, ac->ac_g_ex.fe_start, -+ ac->ac_g_ex.fe_len, &ex); -+ -+ if (max > 0) { ++ ac->ac_g_ex.fe_len, &ex); ++ ++ if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) { ++ unsigned long start; ++ start = (e3b->bd_group * EXT3_BLOCKS_PER_GROUP(ac->ac_sb) + ++ ex.fe_start + le32_to_cpu(es->s_first_data_block)); ++ if (start % sbi->s_stripe == 0) { ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } ++ } else if (max >= ac->ac_g_ex.fe_len) { ++ J_ASSERT(ex.fe_len > 0); ++ J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); ++ J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ } else if (max > 0 && (ac->ac_flags & EXT3_MB_HINT_MERGE)) { ++ /* Sometimes, caller may want to merge even small ++ * number of blocks to an existing extent */ + J_ASSERT(ex.fe_len > 0); + J_ASSERT(ex.fe_group == ac->ac_g_ex.fe_group); + J_ASSERT(ex.fe_start == ac->ac_g_ex.fe_start); @@ -1423,7 +1472,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + int i, k, max; + + J_ASSERT(ac->ac_2order > 0); -+ for (i = ac->ac_2order; i < sb->s_blocksize_bits + 1; i++) { ++ for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) { + if (grp->bb_counters[i] == 0) + continue; + @@ -1488,11 +1537,46 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + } +} + ++/* ++ * This is a special case for storages like raid5 ++ * we try to find stripe-aligned chunks for stripe-size requests ++ */ ++static void ext3_mb_scan_aligned(struct ext3_allocation_context *ac, ++ struct ext3_buddy *e3b) ++{ ++ struct super_block *sb = ac->ac_sb; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ void *bitmap = EXT3_MB_BITMAP(e3b); ++ struct ext3_free_extent ex; ++ unsigned long i, max; ++ ++ J_ASSERT(sbi->s_stripe != 0); ++ ++ /* find first stripe-aligned block */ ++ i = e3b->bd_group * EXT3_BLOCKS_PER_GROUP(sb) ++ + le32_to_cpu(sbi->s_es->s_first_data_block); ++ i = ((i + sbi->s_stripe - 1) / sbi->s_stripe) * sbi->s_stripe; ++ i = (i - le32_to_cpu(sbi->s_es->s_first_data_block)) ++ % EXT3_BLOCKS_PER_GROUP(sb); ++ ++ while (i < sb->s_blocksize * 8) { ++ if (!mb_test_bit(i, bitmap)) { ++ max = mb_find_extent(e3b, 0, i, sbi->s_stripe, &ex); ++ if (max >= sbi->s_stripe) { ++ ac->ac_found++; ++ ac->ac_b_ex = ex; ++ ext3_mb_use_best_found(ac, e3b); ++ break; ++ } ++ } ++ i += sbi->s_stripe; ++ } ++} ++ +static int ext3_mb_good_group(struct ext3_allocation_context *ac, + int group, int cr) +{ -+ struct ext3_sb_info *sbi = EXT3_SB(ac->ac_sb); -+ struct ext3_group_info *grp = sbi->s_group_info[group]; ++ struct ext3_group_info *grp = EXT3_GROUP_INFO(ac->ac_sb, group); + unsigned free, fragments, i, bits; + + J_ASSERT(cr >= 0 && cr < 4); @@ -1509,15 +1593,18 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + case 0: + J_ASSERT(ac->ac_2order != 0); + bits = ac->ac_sb->s_blocksize_bits + 1; -+ for (i = ac->ac_2order; i < bits; i++) ++ for (i = ac->ac_2order; i <= bits; i++) + if (grp->bb_counters[i] > 0) + return 1; ++ break; + case 1: + if ((free / fragments) >= ac->ac_g_ex.fe_len) + return 1; ++ break; + case 2: + if (free >= ac->ac_g_ex.fe_len) + return 1; ++ break; + case 3: + return 1; + default: @@ -1618,23 +1705,27 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + ac.ac_2order = 0; + ac.ac_criteria = 0; + ++ if (*len == 1 && sbi->s_stripe) { ++ /* looks like a metadata, let's use a dirty hack for raid5 ++ * move all metadata in first groups in hope to hit cached ++ * sectors and thus avoid read-modify cycles in raid5 */ ++ ac.ac_g_ex.fe_group = group = 0; ++ } ++ + /* probably, the request is for 2^8+ blocks (1/2/3/... MB) */ + i = ffs(*len); -+ if (i >= 8) { ++ if (i >= ext3_mb_order2_reqs) { + i--; + if ((*len & (~(1 << i))) == 0) + ac.ac_2order = i; + } + -+ /* Sometimes, caller may want to merge even small -+ * number of blocks to an existing extent */ -+ if (ac.ac_flags & EXT3_MB_HINT_MERGE) { -+ err = ext3_mb_find_by_goal(&ac, &e3b); -+ if (err) -+ goto out_err; -+ if (ac.ac_status == AC_STATUS_FOUND) -+ goto found; -+ } ++ /* first, try the goal */ ++ err = ext3_mb_find_by_goal(&ac, &e3b); ++ if (err) ++ goto out_err; ++ if (ac.ac_status == AC_STATUS_FOUND) ++ goto found; + + /* Let's just scan groups to find more-less suitable blocks */ + cr = ac.ac_2order ? 0 : 1; @@ -1645,7 +1736,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + if (group == EXT3_SB(sb)->s_groups_count) + group = 0; + -+ if (EXT3_MB_GRP_NEED_INIT(sbi->s_group_info[group])) { ++ if (EXT3_MB_GRP_NEED_INIT(EXT3_GROUP_INFO(sb, group))) { + /* we need full data about the group + * to make a good selection */ + err = ext3_mb_load_buddy(ac.ac_sb, group, &e3b); @@ -1673,6 +1764,8 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + ac.ac_groups_scanned++; + if (cr == 0) + ext3_mb_simple_scan_group(&ac, &e3b); ++ else if (cr == 1 && *len == sbi->s_stripe) ++ ext3_mb_scan_aligned(&ac, &e3b); + else + ext3_mb_complex_scan_group(&ac, &e3b); + @@ -1686,7 +1779,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + } + + if (ac.ac_b_ex.fe_len > 0 && ac.ac_status != AC_STATUS_FOUND && -+ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { ++ !(ac.ac_flags & EXT3_MB_HINT_FIRST)) { + /* + * We've been searching too long. Let's try to allocate + * the best chunk we've found so far @@ -1731,8 +1824,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + sbi->s_blocks_reserved, ac.ac_found); + printk("EXT3-fs: groups: "); + for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) -+ printk("%d: %d ", i, -+ sbi->s_group_info[i]->bb_free); ++ printk("%d: %d ", i, EXT3_GROUP_INFO(sb, i)->bb_free); + printk("\n"); +#endif + goto out; @@ -1770,7 +1862,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + *errp = -EIO; + goto out_err; + } -+ ++ + err = ext3_journal_get_write_access(handle, gdp_bh); + if (err) + goto out_err; @@ -1839,7 +1931,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + * path only, here is single block always */ + ext3_mb_release_blocks(sb, 1); + } -+ ++ + if (unlikely(ext3_mb_stats) && ac.ac_g_ex.fe_len > 1) { + atomic_inc(&sbi->s_bal_reqs); + atomic_add(*len, &sbi->s_bal_allocated); @@ -1853,7 +1945,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + atomic_inc(&sbi->s_bal_breaks); + } + -+ ext3_mb_store_history(sb, &ac); ++ ext3_mb_store_history(sb, inode->i_ino, &ac); + + return block; +} @@ -1918,9 +2010,9 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + char buf[20], buf2[20]; + + if (v == SEQ_START_TOKEN) { -+ seq_printf(seq, "%-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", -+ "goal", "result", "found", "grps", "cr", "merge", -+ "tail", "broken"); ++ seq_printf(seq, "%-5s %-8s %-17s %-17s %-5s %-5s %-2s %-5s %-5s %-6s\n", ++ "pid", "inode", "goal", "result", "found", "grps", "cr", ++ "merge", "tail", "broken"); + return 0; + } + @@ -1928,9 +2020,9 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + hs->goal.fe_start, hs->goal.fe_len); + sprintf(buf2, "%u/%u/%u", hs->result.fe_group, + hs->result.fe_start, hs->result.fe_len); -+ seq_printf(seq, "%-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", buf, -+ buf2, hs->found, hs->groups, hs->cr, -+ hs->merged ? "M" : "", hs->tail, ++ seq_printf(seq, "%-5u %-8u %-17s %-17s %-5u %-5u %-2u %-5s %-5u %-6u\n", ++ hs->pid, hs->ino, buf, buf2, hs->found, hs->groups, ++ hs->cr, hs->merged ? "M" : "", hs->tail, + hs->buddy ? 1 << hs->buddy : 0); + return 0; +} @@ -1964,7 +2056,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + s->max = sbi->s_mb_history_max; + s->start = sbi->s_mb_history_cur % s->max; + spin_unlock(&sbi->s_mb_history_lock); -+ ++ + rc = seq_open(file, &ext3_mb_seq_history_ops); + if (rc == 0) { + struct seq_file *m = (struct seq_file *)file->private_data; @@ -1988,10 +2080,104 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + +static struct file_operations ext3_mb_seq_history_fops = { + .owner = THIS_MODULE, -+ .open = ext3_mb_seq_history_open, -+ .read = seq_read, -+ .llseek = seq_lseek, -+ .release = ext3_mb_seq_history_release, ++ .open = ext3_mb_seq_history_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = ext3_mb_seq_history_release, ++}; ++ ++static void *ext3_mb_seq_groups_start(struct seq_file *seq, loff_t *pos) ++{ ++ struct super_block *sb = seq->private; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ long group; ++ ++ if (*pos < 0 || *pos >= sbi->s_groups_count) ++ return NULL; ++ ++ group = *pos + 1; ++ return (void *) group; ++} ++ ++static void *ext3_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos) ++{ ++ struct super_block *sb = seq->private; ++ struct ext3_sb_info *sbi = EXT3_SB(sb); ++ long group; ++ ++ ++*pos; ++ if (*pos < 0 || *pos >= sbi->s_groups_count) ++ return NULL; ++ group = *pos + 1; ++ return (void *) group;; ++} ++ ++static int ext3_mb_seq_groups_show(struct seq_file *seq, void *v) ++{ ++ struct super_block *sb = seq->private; ++ long group = (long) v, i; ++ struct sg { ++ struct ext3_group_info info; ++ unsigned short counters[16]; ++ } sg; ++ ++ group--; ++ if (group == 0) ++ seq_printf(seq, "#%-5s: %-5s %-5s %-5s [ %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n", ++ "group", "free", "frags", "first", "2^0", "2^1", "2^2", ++ "2^3", "2^4", "2^5", "2^6", "2^7", "2^8", "2^9", "2^10", ++ "2^11", "2^12", "2^13"); ++ ++ i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) + ++ sizeof(struct ext3_group_info); ++ ext3_lock_group(sb, group); ++ memcpy(&sg, EXT3_GROUP_INFO(sb, group), i); ++ ext3_unlock_group(sb, group); ++ ++ if (EXT3_MB_GRP_NEED_INIT(&sg.info)) ++ return 0; ++ ++ seq_printf(seq, "#%-5lu: %-5u %-5u %-5u [", group, sg.info.bb_free, ++ sg.info.bb_fragments, sg.info.bb_first_free); ++ for (i = 0; i <= 13; i++) ++ seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ? ++ sg.info.bb_counters[i] : 0); ++ seq_printf(seq, " ]\n"); ++ ++ return 0; ++} ++ ++static void ext3_mb_seq_groups_stop(struct seq_file *seq, void *v) ++{ ++} ++ ++static struct seq_operations ext3_mb_seq_groups_ops = { ++ .start = ext3_mb_seq_groups_start, ++ .next = ext3_mb_seq_groups_next, ++ .stop = ext3_mb_seq_groups_stop, ++ .show = ext3_mb_seq_groups_show, ++}; ++ ++static int ext3_mb_seq_groups_open(struct inode *inode, struct file *file) ++{ ++ struct super_block *sb = PDE(inode)->data; ++ int rc; ++ ++ rc = seq_open(file, &ext3_mb_seq_groups_ops); ++ if (rc == 0) { ++ struct seq_file *m = (struct seq_file *)file->private_data; ++ m->private = sb; ++ } ++ return rc; ++ ++} ++ ++static struct file_operations ext3_mb_seq_groups_fops = { ++ .owner = THIS_MODULE, ++ .open = ext3_mb_seq_groups_open, ++ .read = seq_read, ++ .llseek = seq_lseek, ++ .release = seq_release, +}; + +static void ext3_mb_history_release(struct super_block *sb) @@ -2000,6 +2186,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + char name[64]; + + snprintf(name, sizeof(name) - 1, "%s", bdevname(sb->s_bdev, name)); ++ remove_proc_entry("mb_groups", sbi->s_mb_proc); + remove_proc_entry("mb_history", sbi->s_mb_proc); + remove_proc_entry(name, proc_root_ext3); + @@ -2022,6 +2209,11 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + p->proc_fops = &ext3_mb_seq_history_fops; + p->data = sb; + } ++ p = create_proc_entry("mb_groups", S_IRUGO, sbi->s_mb_proc); ++ if (p) { ++ p->proc_fops = &ext3_mb_seq_groups_fops; ++ p->data = sb; ++ } + } + + sbi->s_mb_history_max = 1000; @@ -2034,7 +2226,8 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c +} + +static void -+ext3_mb_store_history(struct super_block *sb, struct ext3_allocation_context *ac) ++ext3_mb_store_history(struct super_block *sb, unsigned ino, ++ struct ext3_allocation_context *ac) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + struct ext3_mb_history h; @@ -2042,6 +2235,8 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + if (likely(sbi->s_mb_history == NULL)) + return; + ++ h.pid = current->pid; ++ h.ino = ino; + h.goal = ac->ac_g_ex; + h.result = ac->ac_b_ex; + h.found = ac->ac_found; @@ -2069,21 +2264,40 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c +int ext3_mb_init_backend(struct super_block *sb) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); -+ int i, len; -+ -+ len = sizeof(struct ext3_buddy_group_blocks *) * sbi->s_groups_count; -+ sbi->s_group_info = kmalloc(len, GFP_KERNEL); ++ int i, j, len, metalen; ++ int num_meta_group_infos = ++ (sbi->s_groups_count + EXT3_DESC_PER_BLOCK(sb) - 1) >> ++ EXT3_DESC_PER_BLOCK_BITS(sb); ++ struct ext3_group_info **meta_group_info; ++ ++ /* An 8TB filesystem with 64-bit pointers requires a 4096 byte ++ * kmalloc. A 128kb malloc should suffice for a 256TB filesystem. ++ * So a two level scheme suffices for now. */ ++ sbi->s_group_info = kmalloc(sizeof(*sbi->s_group_info) * ++ num_meta_group_infos, GFP_KERNEL); + if (sbi->s_group_info == NULL) { -+ printk(KERN_ERR "EXT3-fs: can't allocate mem for buddy\n"); ++ printk(KERN_ERR "EXT3-fs: can't allocate buddy meta group\n"); + return -ENOMEM; + } -+ memset(sbi->s_group_info, 0, len); -+ + sbi->s_buddy_cache = new_inode(sb); + if (sbi->s_buddy_cache == NULL) { + printk(KERN_ERR "EXT3-fs: can't get new inode\n"); -+ kfree(sbi->s_group_info); -+ return -ENOMEM; ++ goto err_freesgi; ++ } ++ ++ metalen = sizeof(*meta_group_info) << EXT3_DESC_PER_BLOCK_BITS(sb); ++ for (i = 0; i < num_meta_group_infos; i++) { ++ if ((i + 1) == num_meta_group_infos) ++ metalen = sizeof(*meta_group_info) * ++ (sbi->s_groups_count - ++ (i << EXT3_DESC_PER_BLOCK_BITS(sb))); ++ meta_group_info = kmalloc(metalen, GFP_KERNEL); ++ if (meta_group_info == NULL) { ++ printk(KERN_ERR "EXT3-fs: can't allocate mem for a " ++ "buddy group\n"); ++ goto err_freemeta; ++ } ++ sbi->s_group_info[i] = meta_group_info; + } + + /* @@ -2095,30 +2309,42 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + for (i = 0; i < sbi->s_groups_count; i++) { + struct ext3_group_desc * desc; + -+ sbi->s_group_info[i] = kmalloc(len, GFP_KERNEL); -+ if (sbi->s_group_info[i] == NULL) { ++ meta_group_info = ++ sbi->s_group_info[i >> EXT3_DESC_PER_BLOCK_BITS(sb)]; ++ j = i & (EXT3_DESC_PER_BLOCK(sb) - 1); ++ ++ meta_group_info[j] = kmalloc(len, GFP_KERNEL); ++ if (meta_group_info[j] == NULL) { + printk(KERN_ERR "EXT3-fs: can't allocate buddy mem\n"); -+ goto err_out; ++ i--; ++ goto err_freebuddy; + } + desc = ext3_get_group_desc(sb, i, NULL); + if (desc == NULL) { + printk(KERN_ERR"EXT3-fs: can't read descriptor %u\n",i); -+ goto err_out; ++ goto err_freebuddy; + } -+ memset(sbi->s_group_info[i], 0, len); ++ memset(meta_group_info[j], 0, len); + set_bit(EXT3_GROUP_INFO_NEED_INIT_BIT, -+ &sbi->s_group_info[i]->bb_state); -+ sbi->s_group_info[i]->bb_free = ++ &meta_group_info[j]->bb_state); ++ meta_group_info[j]->bb_free = + le16_to_cpu(desc->bg_free_blocks_count); + } + + return 0; + -+err_out: ++err_freebuddy: ++ while (i >= 0) { ++ kfree(EXT3_GROUP_INFO(sb, i)); ++ i--; ++ } ++ i = num_meta_group_infos; ++err_freemeta: + while (--i >= 0) + kfree(sbi->s_group_info[i]); + iput(sbi->s_buddy_cache); -+ ++err_freesgi: ++ kfree(sbi->s_group_info); + return -ENOMEM; +} + @@ -2160,7 +2386,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + max = max >> 1; + i++; + } while (i <= sb->s_blocksize_bits + 1); -+ ++ + + /* init file for buddy data */ + if ((i = ext3_mb_init_backend(sb))) { @@ -2197,8 +2423,8 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c +int ext3_mb_release(struct super_block *sb) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); -+ int i; -+ ++ int i, num_meta_group_infos; ++ + if (!test_opt(sb, MBALLOC)) + return 0; + @@ -2212,11 +2438,13 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + ext3_mb_free_committed_blocks(sb); + + if (sbi->s_group_info) { -+ for (i = 0; i < sbi->s_groups_count; i++) { -+ if (sbi->s_group_info[i] == NULL) -+ continue; ++ for (i = 0; i < sbi->s_groups_count; i++) ++ kfree(EXT3_GROUP_INFO(sb, i)); ++ num_meta_group_infos = (sbi->s_groups_count + ++ EXT3_DESC_PER_BLOCK(sb) - 1) >> ++ EXT3_DESC_PER_BLOCK_BITS(sb); ++ for (i = 0; i < num_meta_group_infos; i++) + kfree(sbi->s_group_info[i]); -+ } + kfree(sbi->s_group_info); + } + if (sbi->s_mb_offsets) @@ -2510,7 +2738,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) + count); + spin_unlock(sb_bgl_lock(sbi, block_group)); + percpu_counter_mod(&sbi->s_freeblocks_counter, count); -+ ++ + ext3_mb_release_desc(&e3b); + + *freed = count; @@ -2593,10 +2821,11 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + return; +} + -+#define EXT3_ROOT "ext3" -+#define EXT3_MB_STATS_NAME "mb_stats" ++#define EXT3_ROOT "ext3" ++#define EXT3_MB_STATS_NAME "mb_stats" +#define EXT3_MB_MAX_TO_SCAN_NAME "mb_max_to_scan" +#define EXT3_MB_MIN_TO_SCAN_NAME "mb_min_to_scan" ++#define EXT3_MB_ORDER2_REQ "mb_order2_req" + +static int ext3_mb_stats_read(char *page, char **start, off_t off, + int count, int *eof, void *data) @@ -2684,6 +2913,45 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + return len; +} + ++static int ext3_mb_order2_req_write(struct file *file, const char *buffer, ++ unsigned long count, void *data) ++{ ++ char str[32]; ++ long value; ++ ++ if (count >= sizeof(str)) { ++ printk(KERN_ERR "EXT3-fs: %s string too long, max %u bytes\n", ++ EXT3_MB_MIN_TO_SCAN_NAME, (int)sizeof(str)); ++ return -EOVERFLOW; ++ } ++ ++ if (copy_from_user(str, buffer, count)) ++ return -EFAULT; ++ ++ /* Only set to 0 or 1 respectively; zero->0; non-zero->1 */ ++ value = simple_strtol(str, NULL, 0); ++ if (value <= 0) ++ return -ERANGE; ++ ++ ext3_mb_order2_reqs = value; ++ ++ return count; ++} ++ ++static int ext3_mb_order2_req_read(char *page, char **start, off_t off, ++ int count, int *eof, void *data) ++{ ++ int len; ++ ++ *eof = 1; ++ if (off != 0) ++ return 0; ++ ++ len = sprintf(page, "%ld\n", ext3_mb_order2_reqs); ++ *start = page; ++ return len; ++} ++ +static int ext3_mb_min_to_scan_write(struct file *file, const char *buffer, + unsigned long count, void *data) +{ @@ -2691,7 +2959,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + long value; + + if (count >= sizeof(str)) { -+ printk(KERN_ERR "EXT3: %s string too long, max %u bytes\n", ++ printk(KERN_ERR "EXT3-fs: %s string too long, max %u bytes\n", + EXT3_MB_MIN_TO_SCAN_NAME, (int)sizeof(str)); + return -EOVERFLOW; + } @@ -2714,10 +2982,11 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + struct proc_dir_entry *proc_ext3_mb_stats; + struct proc_dir_entry *proc_ext3_mb_max_to_scan; + struct proc_dir_entry *proc_ext3_mb_min_to_scan; ++ struct proc_dir_entry *proc_ext3_mb_order2_req; + + proc_root_ext3 = proc_mkdir(EXT3_ROOT, proc_root_fs); + if (proc_root_ext3 == NULL) { -+ printk(KERN_ERR "EXT3: Unable to create %s\n", EXT3_ROOT); ++ printk(KERN_ERR "EXT3-fs: Unable to create %s\n", EXT3_ROOT); + return -EIO; + } + @@ -2725,7 +2994,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + proc_ext3_mb_stats = create_proc_entry(EXT3_MB_STATS_NAME, + S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); + if (proc_ext3_mb_stats == NULL) { -+ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ printk(KERN_ERR "EXT3-fs: Unable to create %s\n", + EXT3_MB_STATS_NAME); + remove_proc_entry(EXT3_ROOT, proc_root_fs); + return -EIO; @@ -2740,7 +3009,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + EXT3_MB_MAX_TO_SCAN_NAME, + S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); + if (proc_ext3_mb_max_to_scan == NULL) { -+ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ printk(KERN_ERR "EXT3-fs: Unable to create %s\n", + EXT3_MB_MAX_TO_SCAN_NAME); + remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); + remove_proc_entry(EXT3_ROOT, proc_root_fs); @@ -2756,7 +3025,7 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + EXT3_MB_MIN_TO_SCAN_NAME, + S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); + if (proc_ext3_mb_min_to_scan == NULL) { -+ printk(KERN_ERR "EXT3: Unable to create %s\n", ++ printk(KERN_ERR "EXT3-fs: Unable to create %s\n", + EXT3_MB_MIN_TO_SCAN_NAME); + remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); + remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); @@ -2768,6 +3037,24 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + proc_ext3_mb_min_to_scan->read_proc = ext3_mb_min_to_scan_read; + proc_ext3_mb_min_to_scan->write_proc = ext3_mb_min_to_scan_write; + ++ /* Initialize EXT3_ORDER2_REQ */ ++ proc_ext3_mb_order2_req = create_proc_entry( ++ EXT3_MB_ORDER2_REQ, ++ S_IFREG | S_IRUGO | S_IWUSR, proc_root_ext3); ++ if (proc_ext3_mb_order2_req == NULL) { ++ printk(KERN_ERR "EXT3-fs: Unable to create %s\n", ++ EXT3_MB_ORDER2_REQ); ++ remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_ROOT, proc_root_fs); ++ return -EIO; ++ } ++ ++ proc_ext3_mb_order2_req->data = NULL; ++ proc_ext3_mb_order2_req->read_proc = ext3_mb_order2_req_read; ++ proc_ext3_mb_order2_req->write_proc = ext3_mb_order2_req_write; ++ + return 0; +} + @@ -2776,12 +3063,13 @@ Index: linux-2.6.9-full/fs/ext3/mballoc.c + remove_proc_entry(EXT3_MB_STATS_NAME, proc_root_ext3); + remove_proc_entry(EXT3_MB_MAX_TO_SCAN_NAME, proc_root_ext3); + remove_proc_entry(EXT3_MB_MIN_TO_SCAN_NAME, proc_root_ext3); ++ remove_proc_entry(EXT3_MB_ORDER2_REQ, proc_root_ext3); + remove_proc_entry(EXT3_ROOT, proc_root_fs); +} -Index: linux-2.6.9-full/fs/ext3/Makefile +Index: linux-stage/fs/ext3/Makefile =================================================================== ---- linux-2.6.9-full.orig/fs/ext3/Makefile 2005-12-16 23:16:41.000000000 +0300 -+++ linux-2.6.9-full/fs/ext3/Makefile 2005-12-16 23:16:42.000000000 +0300 +--- linux-stage.orig/fs/ext3/Makefile 2006-05-25 10:36:04.000000000 -0600 ++++ linux-stage/fs/ext3/Makefile 2006-05-25 10:36:04.000000000 -0600 @@ -6,7 +6,7 @@ ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ diff --git a/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.12.patch b/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.12.patch new file mode 100644 index 0000000..ef0f4a4 --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.12.patch @@ -0,0 +1,64 @@ +Subject: Avoid disk sector_t overflow for >2TB ext3 filesystem +From: Mingming Cao + + +If ext3 filesystem is larger than 2TB, and sector_t is a u32 (i.e. +CONFIG_LBD not defined in the kernel), the calculation of the disk sector +will overflow. Add check at ext3_fill_super() and ext3_group_extend() to +prevent mount/remount/resize >2TB ext3 filesystem if sector_t size is 4 +bytes. + +Verified this patch on a 32 bit platform without CONFIG_LBD defined +(sector_t is 32 bits long), mount refuse to mount a 10TB ext3. + +Signed-off-by: Mingming Cao +Acked-by: Andreas Dilger +Signed-off-by: Andrew Morton +--- + + fs/ext3/resize.c | 10 ++++++++++ + fs/ext3/super.c | 10 ++++++++++ + 2 files changed, 20 insertions(+) + +diff -puN fs/ext3/resize.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem fs/ext3/resize.c +--- devel/fs/ext3/resize.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem 2006-05-22 14:09:53.000000000 -0700 ++++ devel-akpm/fs/ext3/resize.c 2006-05-22 14:10:56.000000000 -0700 +@@ -926,6 +926,16 @@ int ext3_group_extend(struct super_block + if (n_blocks_count == 0 || n_blocks_count == o_blocks_count) + return 0; + ++ if (n_blocks_count > (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) { ++ printk(KERN_ERR "EXT3-fs: filesystem on %s: " ++ "too large to resize to %lu blocks safely\n", ++ sb->s_id, n_blocks_count); ++ if (sizeof(sector_t) < 8) ++ ext3_warning(sb, __FUNCTION__, ++ "CONFIG_LBD not enabled\n"); ++ return -EINVAL; ++ } ++ + if (n_blocks_count < o_blocks_count) { + ext3_warning(sb, __FUNCTION__, + "can't shrink FS - resize aborted"); +diff -puN fs/ext3/super.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem fs/ext3/super.c +--- devel/fs/ext3/super.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem 2006-05-22 14:09:53.000000000 -0700 ++++ devel-akpm/fs/ext3/super.c 2006-05-22 14:11:10.000000000 -0700 +@@ -1565,6 +1565,17 @@ static int ext3_fill_super (struct super + goto failed_mount; + } + ++ if (le32_to_cpu(es->s_blocks_count) > ++ (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) { ++ printk(KERN_ERR "EXT3-fs: filesystem on %s: " ++ "too large to mount safely - %u blocks\n", sb->s_id, ++ le32_to_cpu(es->s_blocks_count)); ++ if (sizeof(sector_t) < 8) ++ printk(KERN_WARNING ++ "EXT3-fs: CONFIG_LBD not enabled\n"); ++ goto failed_mount; ++ } ++ + if (EXT3_BLOCKS_PER_GROUP(sb) == 0) + goto cantfind_ext3; + sbi->s_groups_count = (le32_to_cpu(es->s_blocks_count) - +_ diff --git a/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.5-suse.patch b/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.5-suse.patch new file mode 100644 index 0000000..fe655da --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.5-suse.patch @@ -0,0 +1,44 @@ +Subject: Avoid disk sector_t overflow for >2TB ext3 filesystem +From: Mingming Cao + + +If ext3 filesystem is larger than 2TB, and sector_t is a u32 (i.e. +CONFIG_LBD not defined in the kernel), the calculation of the disk sector +will overflow. Add check at ext3_fill_super() and ext3_group_extend() to +prevent mount/remount/resize >2TB ext3 filesystem if sector_t size is 4 +bytes. + +Verified this patch on a 32 bit platform without CONFIG_LBD defined +(sector_t is 32 bits long), mount refuse to mount a 10TB ext3. + +Signed-off-by: Mingming Cao +Acked-by: Andreas Dilger +Signed-off-by: Andrew Morton +--- + + fs/ext3/resize.c | 10 ++++++++++ + fs/ext3/super.c | 10 ++++++++++ + 2 files changed, 20 insertions(+) + +diff -puN fs/ext3/super.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem fs/ext3/super.c +--- devel/fs/ext3/super.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem 2006-05-22 14:09:53.000000000 -0700 ++++ devel-akpm/fs/ext3/super.c 2006-05-22 14:11:10.000000000 -0700 +@@ -1565,6 +1565,17 @@ static int ext3_fill_super (struct super + goto failed_mount; + } + ++ if (le32_to_cpu(es->s_blocks_count) > ++ (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) { ++ printk(KERN_ERR "EXT3-fs: filesystem on %s: " ++ "too large to mount safely - %u blocks\n", sb->s_id, ++ le32_to_cpu(es->s_blocks_count)); ++ if (sizeof(sector_t) < 8) ++ printk(KERN_WARNING ++ "EXT3-fs: CONFIG_LBD not enabled\n"); ++ goto failed_mount; ++ } ++ + sbi->s_groups_count = (le32_to_cpu(es->s_blocks_count) - + le32_to_cpu(es->s_first_data_block) + + EXT3_BLOCKS_PER_GROUP(sb) - 1) / +_ diff --git a/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.9-rhel4.patch b/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.9-rhel4.patch new file mode 100644 index 0000000..9bfdf80 --- /dev/null +++ b/ldiskfs/kernel_patches/patches/ext3-sector_t-overflow-2.6.9-rhel4.patch @@ -0,0 +1,64 @@ +Subject: Avoid disk sector_t overflow for >2TB ext3 filesystem +From: Mingming Cao + + +If ext3 filesystem is larger than 2TB, and sector_t is a u32 (i.e. +CONFIG_LBD not defined in the kernel), the calculation of the disk sector +will overflow. Add check at ext3_fill_super() and ext3_group_extend() to +prevent mount/remount/resize >2TB ext3 filesystem if sector_t size is 4 +bytes. + +Verified this patch on a 32 bit platform without CONFIG_LBD defined +(sector_t is 32 bits long), mount refuse to mount a 10TB ext3. + +Signed-off-by: Mingming Cao +Acked-by: Andreas Dilger +Signed-off-by: Andrew Morton +--- + + fs/ext3/resize.c | 10 ++++++++++ + fs/ext3/super.c | 10 ++++++++++ + 2 files changed, 20 insertions(+) + +diff -puN fs/ext3/resize.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem fs/ext3/resize.c +--- devel/fs/ext3/resize.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem 2006-05-22 14:09:53.000000000 -0700 ++++ devel-akpm/fs/ext3/resize.c 2006-05-22 14:10:56.000000000 -0700 +@@ -926,6 +926,16 @@ int ext3_group_extend(struct super_block + if (n_blocks_count == 0 || n_blocks_count == o_blocks_count) + return 0; + ++ if (n_blocks_count > (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) { ++ printk(KERN_ERR "EXT3-fs: filesystem on %s: " ++ "too large to resize to %lu blocks safely\n", ++ sb->s_id, n_blocks_count); ++ if (sizeof(sector_t) < 8) ++ ext3_warning(sb, __FUNCTION__, ++ "CONFIG_LBD not enabled\n"); ++ return -EINVAL; ++ } ++ + if (n_blocks_count < o_blocks_count) { + ext3_warning(sb, __FUNCTION__, + "can't shrink FS - resize aborted"); +diff -puN fs/ext3/super.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem fs/ext3/super.c +--- devel/fs/ext3/super.c~avoid-disk-sector_t-overflow-for-2tb-ext3-filesystem 2006-05-22 14:09:53.000000000 -0700 ++++ devel-akpm/fs/ext3/super.c 2006-05-22 14:11:10.000000000 -0700 +@@ -1565,6 +1565,17 @@ static int ext3_fill_super (struct super + goto failed_mount; + } + ++ if (le32_to_cpu(es->s_blocks_count) > ++ (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) { ++ printk(KERN_ERR "EXT3-fs: filesystem on %s: " ++ "too large to mount safely - %u blocks\n", sb->s_id, ++ le32_to_cpu(es->s_blocks_count)); ++ if (sizeof(sector_t) < 8) ++ printk(KERN_WARNING ++ "EXT3-fs: CONFIG_LBD not enabled\n"); ++ goto failed_mount; ++ } ++ + sbi->s_groups_count = (le32_to_cpu(es->s_blocks_count) - + le32_to_cpu(es->s_first_data_block) + + EXT3_BLOCKS_PER_GROUP(sb) - 1) / +_ diff --git a/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-rhel4.patch b/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-rhel4.patch index 1c5c6ab..b586a2f 100644 --- a/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-rhel4.patch +++ b/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-rhel4.patch @@ -19,16 +19,19 @@ Index: uml-2.6.3/fs/ext3/ialloc.c { struct super_block *sb; struct buffer_head *bitmap_bh = NULL; -@@ -448,6 +449,38 @@ +@@ -448,6 +449,41 @@ sbi = EXT3_SB(sb); es = sbi->s_es; + if (goal) { + group = (goal - 1) / EXT3_INODES_PER_GROUP(sb); + ino = (goal - 1) % EXT3_INODES_PER_GROUP(sb); ++ err = -EIO; ++ + gdp = ext3_get_group_desc(sb, group, &bh2); ++ if (!gdp) ++ goto fail; + -+ err = -EIO; + bitmap_bh = read_inode_bitmap (sb, group); + if (!bitmap_bh) + goto fail; diff --git a/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-suse.patch b/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-suse.patch index a4867a5..33535dc 100644 --- a/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-suse.patch +++ b/ldiskfs/kernel_patches/patches/ext3-wantedi-2.6-suse.patch @@ -19,16 +19,19 @@ Index: uml-2.6.3/fs/ext3/ialloc.c { struct super_block *sb; struct buffer_head *bitmap_bh = NULL; -@@ -448,6 +449,38 @@ +@@ -448,6 +449,41 @@ sbi = EXT3_SB(sb); es = sbi->s_es; + if (goal) { + group = (goal - 1) / EXT3_INODES_PER_GROUP(sb); + ino = (goal - 1) % EXT3_INODES_PER_GROUP(sb); ++ err = -EIO; ++ + gdp = ext3_get_group_desc(sb, group, &bh2); ++ if (!gdp) ++ goto fail; + -+ err = -EIO; + bitmap_bh = read_inode_bitmap (sb, group); + if (!bitmap_bh) + goto fail; diff --git a/ldiskfs/kernel_patches/patches/iopen-2.6-fc5.patch b/ldiskfs/kernel_patches/patches/iopen-2.6-fc5.patch new file mode 100644 index 0000000..6bbcec5 --- /dev/null +++ b/ldiskfs/kernel_patches/patches/iopen-2.6-fc5.patch @@ -0,0 +1,448 @@ +Index: linux-2.6.16.i686/fs/ext3/iopen.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/iopen.c 2006-05-31 04:14:15.752410384 +0800 ++++ linux-2.6.16.i686/fs/ext3/iopen.c 2006-05-30 22:52:38.000000000 +0800 +@@ -0,0 +1,259 @@ ++/* ++ * linux/fs/ext3/iopen.c ++ * ++ * Special support for open by inode number ++ * ++ * Copyright (C) 2001 by Theodore Ts'o (tytso@alum.mit.edu). ++ * ++ * This file may be redistributed under the terms of the GNU General ++ * Public License. ++ * ++ * ++ * Invariants: ++ * - there is only ever a single DCACHE_NFSD_DISCONNECTED dentry alias ++ * for an inode at one time. ++ * - there are never both connected and DCACHE_NFSD_DISCONNECTED dentry ++ * aliases on an inode at the same time. ++ * ++ * If we have any connected dentry aliases for an inode, use one of those ++ * in iopen_lookup(). Otherwise, we instantiate a single NFSD_DISCONNECTED ++ * dentry for this inode, which thereafter will be found by the dcache ++ * when looking up this inode number in __iopen__, so we don't return here ++ * until it is gone. ++ * ++ * If we get an inode via a regular name lookup, then we "rename" the ++ * NFSD_DISCONNECTED dentry to the proper name and parent. This ensures ++ * existing users of the disconnected dentry will continue to use the same ++ * dentry as the connected users, and there will never be both kinds of ++ * dentry aliases at one time. ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include "iopen.h" ++ ++#ifndef assert ++#define assert(test) J_ASSERT(test) ++#endif ++ ++#define IOPEN_NAME_LEN 32 ++ ++/* ++ * This implements looking up an inode by number. ++ */ ++static struct dentry *iopen_lookup(struct inode * dir, struct dentry *dentry, ++ struct nameidata *nd) ++{ ++ struct inode *inode; ++ unsigned long ino; ++ struct list_head *lp; ++ struct dentry *alternate; ++ char buf[IOPEN_NAME_LEN]; ++ ++ if (dentry->d_name.len >= IOPEN_NAME_LEN) ++ return ERR_PTR(-ENAMETOOLONG); ++ ++ memcpy(buf, dentry->d_name.name, dentry->d_name.len); ++ buf[dentry->d_name.len] = 0; ++ ++ if (strcmp(buf, ".") == 0) ++ ino = dir->i_ino; ++ else if (strcmp(buf, "..") == 0) ++ ino = EXT3_ROOT_INO; ++ else ++ ino = simple_strtoul(buf, 0, 0); ++ ++ if ((ino != EXT3_ROOT_INO && ++ //ino != EXT3_ACL_IDX_INO && ++ //ino != EXT3_ACL_DATA_INO && ++ ino < EXT3_FIRST_INO(dir->i_sb)) || ++ ino > le32_to_cpu(EXT3_SB(dir->i_sb)->s_es->s_inodes_count)) ++ return ERR_PTR(-ENOENT); ++ ++ inode = iget(dir->i_sb, ino); ++ if (!inode) ++ return ERR_PTR(-EACCES); ++ if (is_bad_inode(inode)) { ++ iput(inode); ++ return ERR_PTR(-ENOENT); ++ } ++ ++ assert(list_empty(&dentry->d_alias)); /* d_instantiate */ ++ assert(d_unhashed(dentry)); /* d_rehash */ ++ ++ /* preferrably return a connected dentry */ ++ spin_lock(&dcache_lock); ++ list_for_each(lp, &inode->i_dentry) { ++ alternate = list_entry(lp, struct dentry, d_alias); ++ assert(!(alternate->d_flags & DCACHE_DISCONNECTED)); ++ } ++ ++ if (!list_empty(&inode->i_dentry)) { ++ alternate = list_entry(inode->i_dentry.next, ++ struct dentry, d_alias); ++ dget_locked(alternate); ++ spin_lock(&alternate->d_lock); ++ alternate->d_flags |= DCACHE_REFERENCED; ++ spin_unlock(&alternate->d_lock); ++ iput(inode); ++ spin_unlock(&dcache_lock); ++ return alternate; ++ } ++ dentry->d_flags |= DCACHE_DISCONNECTED; ++ ++ /* d_add(), but don't drop dcache_lock before adding dentry to inode */ ++ list_add(&dentry->d_alias, &inode->i_dentry); /* d_instantiate */ ++ dentry->d_inode = inode; ++ spin_unlock(&dcache_lock); ++ ++ d_rehash(dentry); ++ ++ return NULL; ++} ++ ++/* This function is spliced into ext3_lookup and does the move of a ++ * disconnected dentry (if it exists) to a connected dentry. ++ */ ++struct dentry *iopen_connect_dentry(struct dentry *dentry, struct inode *inode, ++ int rehash) ++{ ++ struct dentry *tmp, *goal = NULL; ++ struct list_head *lp; ++ ++ /* verify this dentry is really new */ ++ assert(dentry->d_inode == NULL); ++ assert(list_empty(&dentry->d_alias)); /* d_instantiate */ ++ if (rehash) ++ assert(d_unhashed(dentry)); /* d_rehash */ ++ assert(list_empty(&dentry->d_subdirs)); ++ ++ spin_lock(&dcache_lock); ++ if (!inode) ++ goto do_rehash; ++ ++ if (!test_opt(inode->i_sb, IOPEN)) ++ goto do_instantiate; ++ ++ /* preferrably return a connected dentry */ ++ list_for_each(lp, &inode->i_dentry) { ++ tmp = list_entry(lp, struct dentry, d_alias); ++ if (tmp->d_flags & DCACHE_DISCONNECTED) { ++ assert(tmp->d_alias.next == &inode->i_dentry); ++ assert(tmp->d_alias.prev == &inode->i_dentry); ++ goal = tmp; ++ dget_locked(goal); ++ break; ++ } ++ } ++ ++ if (!goal) ++ goto do_instantiate; ++ ++ /* Move the goal to the de hash queue */ ++ goal->d_flags &= ~DCACHE_DISCONNECTED; ++ security_d_instantiate(goal, inode); ++ __d_drop(dentry); ++ spin_unlock(&dcache_lock); ++ d_rehash(dentry); ++ d_move(goal, dentry); ++ iput(inode); ++ ++ return goal; ++ ++ /* d_add(), but don't drop dcache_lock before adding dentry to inode */ ++do_instantiate: ++ list_add(&dentry->d_alias, &inode->i_dentry); /* d_instantiate */ ++ dentry->d_inode = inode; ++do_rehash: ++ spin_unlock(&dcache_lock); ++ if (rehash) ++ d_rehash(dentry); ++ ++ return NULL; ++} ++ ++/* ++ * These are the special structures for the iopen pseudo directory. ++ */ ++ ++static struct inode_operations iopen_inode_operations = { ++ lookup: iopen_lookup, /* BKL held */ ++}; ++ ++static struct file_operations iopen_file_operations = { ++ read: generic_read_dir, ++}; ++ ++static int match_dentry(struct dentry *dentry, const char *name) ++{ ++ int len; ++ ++ len = strlen(name); ++ if (dentry->d_name.len != len) ++ return 0; ++ if (strncmp(dentry->d_name.name, name, len)) ++ return 0; ++ return 1; ++} ++ ++/* ++ * This function is spliced into ext3_lookup and returns 1 the file ++ * name is __iopen__ and dentry has been filled in appropriately. ++ */ ++int ext3_check_for_iopen(struct inode *dir, struct dentry *dentry) ++{ ++ struct inode *inode; ++ ++ if (dir->i_ino != EXT3_ROOT_INO || ++ !test_opt(dir->i_sb, IOPEN) || ++ !match_dentry(dentry, "__iopen__")) ++ return 0; ++ ++ inode = iget(dir->i_sb, EXT3_BAD_INO); ++ ++ if (!inode) ++ return 0; ++ d_add(dentry, inode); ++ return 1; ++} ++ ++/* ++ * This function is spliced into read_inode; it returns 1 if inode ++ * number is the one for /__iopen__, in which case the inode is filled ++ * in appropriately. Otherwise, this fuction returns 0. ++ */ ++int ext3_iopen_get_inode(struct inode *inode) ++{ ++ if (inode->i_ino != EXT3_BAD_INO) ++ return 0; ++ ++ inode->i_mode = S_IFDIR | S_IRUSR | S_IXUSR; ++ if (test_opt(inode->i_sb, IOPEN_NOPRIV)) ++ inode->i_mode |= 0777; ++ inode->i_uid = 0; ++ inode->i_gid = 0; ++ inode->i_nlink = 1; ++ inode->i_size = 4096; ++ inode->i_atime = CURRENT_TIME; ++ inode->i_ctime = CURRENT_TIME; ++ inode->i_mtime = CURRENT_TIME; ++ EXT3_I(inode)->i_dtime = 0; ++ inode->i_blksize = PAGE_SIZE; /* This is the optimal IO size ++ * (for stat), not the fs block ++ * size */ ++ inode->i_blocks = 0; ++ inode->i_version = 1; ++ inode->i_generation = 0; ++ ++ inode->i_op = &iopen_inode_operations; ++ inode->i_fop = &iopen_file_operations; ++ inode->i_mapping->a_ops = 0; ++ ++ return 1; ++} +Index: linux-2.6.16.i686/fs/ext3/iopen.h +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/iopen.h 2006-05-31 04:14:15.752410384 +0800 ++++ linux-2.6.16.i686/fs/ext3/iopen.h 2006-05-30 22:52:38.000000000 +0800 +@@ -0,0 +1,15 @@ ++/* ++ * iopen.h ++ * ++ * Special support for opening files by inode number. ++ * ++ * Copyright (C) 2001 by Theodore Ts'o (tytso@alum.mit.edu). ++ * ++ * This file may be redistributed under the terms of the GNU General ++ * Public License. ++ */ ++ ++extern int ext3_check_for_iopen(struct inode *dir, struct dentry *dentry); ++extern int ext3_iopen_get_inode(struct inode *inode); ++extern struct dentry *iopen_connect_dentry(struct dentry *dentry, ++ struct inode *inode, int rehash); +Index: linux-2.6.16.i686/fs/ext3/inode.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/inode.c 2006-05-30 22:52:03.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/inode.c 2006-05-30 22:52:38.000000000 +0800 +@@ -37,6 +37,7 @@ + #include + #include + #include "xattr.h" ++#include "iopen.h" + #include "acl.h" + + static int ext3_writepage_trans_blocks(struct inode *inode); +@@ -2448,6 +2449,8 @@ + ei->i_default_acl = EXT3_ACL_NOT_CACHED; + #endif + ei->i_block_alloc_info = NULL; ++ if (ext3_iopen_get_inode(inode)) ++ return; + + if (__ext3_get_inode_loc(inode, &iloc, 0)) + goto bad_inode; +Index: linux-2.6.16.i686/fs/ext3/super.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/super.c 2006-05-30 22:52:03.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/super.c 2006-05-30 22:52:38.000000000 +0800 +@@ -634,6 +634,7 @@ + Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, + Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, + Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, ++ Opt_iopen, Opt_noiopen, Opt_iopen_nopriv, + Opt_grpquota + }; + +@@ -682,6 +683,9 @@ + {Opt_noquota, "noquota"}, + {Opt_quota, "quota"}, + {Opt_usrquota, "usrquota"}, ++ {Opt_iopen, "iopen"}, ++ {Opt_noiopen, "noiopen"}, ++ {Opt_iopen_nopriv, "iopen_nopriv"}, + {Opt_barrier, "barrier=%u"}, + {Opt_err, NULL}, + {Opt_resize, "resize"}, +@@ -996,6 +1000,18 @@ + else + clear_opt(sbi->s_mount_opt, BARRIER); + break; ++ case Opt_iopen: ++ set_opt (sbi->s_mount_opt, IOPEN); ++ clear_opt (sbi->s_mount_opt, IOPEN_NOPRIV); ++ break; ++ case Opt_noiopen: ++ clear_opt (sbi->s_mount_opt, IOPEN); ++ clear_opt (sbi->s_mount_opt, IOPEN_NOPRIV); ++ break; ++ case Opt_iopen_nopriv: ++ set_opt (sbi->s_mount_opt, IOPEN); ++ set_opt (sbi->s_mount_opt, IOPEN_NOPRIV); ++ break; + case Opt_ignore: + break; + case Opt_resize: +Index: linux-2.6.16.i686/fs/ext3/namei.c +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/namei.c 2006-05-30 22:52:00.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/namei.c 2006-05-30 22:55:19.000000000 +0800 +@@ -39,6 +39,7 @@ + + #include "namei.h" + #include "xattr.h" ++#include "iopen.h" + #include "acl.h" + + /* +@@ -995,6 +996,9 @@ + if (dentry->d_name.len > EXT3_NAME_LEN) + return ERR_PTR(-ENAMETOOLONG); + ++ if (ext3_check_for_iopen(dir, dentry)) ++ return NULL; ++ + bh = ext3_find_entry(dentry, &de); + inode = NULL; + if (bh) { +@@ -1005,7 +1009,7 @@ + if (!inode) + return ERR_PTR(-EACCES); + } +- return d_splice_alias(inode, dentry); ++ return iopen_connect_dentry(dentry, inode, 1); + } + + +@@ -2046,10 +2050,6 @@ + inode->i_nlink); + inode->i_version++; + inode->i_nlink = 0; +- /* There's no need to set i_disksize: the fact that i_nlink is +- * zero will ensure that the right thing happens during any +- * recovery. */ +- inode->i_size = 0; + ext3_orphan_add(handle, inode); + inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC; + ext3_mark_inode_dirty(handle, inode); +@@ -2173,6 +2173,23 @@ + return err; + } + ++/* Like ext3_add_nondir() except for call to iopen_connect_dentry */ ++static int ext3_add_link(handle_t *handle, struct dentry *dentry, ++ struct inode *inode) ++{ ++ int err = ext3_add_entry(handle, dentry, inode); ++ if (!err) { ++ err = ext3_mark_inode_dirty(handle, inode); ++ if (err == 0) { ++ dput(iopen_connect_dentry(dentry, inode, 0)); ++ return 0; ++ } ++ } ++ ext3_dec_count(handle, inode); ++ iput(inode); ++ return err; ++} ++ + static int ext3_link (struct dentry * old_dentry, + struct inode * dir, struct dentry *dentry) + { +@@ -2196,7 +2213,8 @@ + ext3_inc_count(handle, inode); + atomic_inc(&inode->i_count); + +- err = ext3_add_nondir(handle, dentry, inode); ++ err = ext3_add_link(handle, dentry, inode); ++ ext3_orphan_del(handle, inode); + ext3_journal_stop(handle); + if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries)) + goto retry; +Index: linux-2.6.16.i686/fs/ext3/Makefile +=================================================================== +--- linux-2.6.16.i686.orig/fs/ext3/Makefile 2006-03-20 13:53:29.000000000 +0800 ++++ linux-2.6.16.i686/fs/ext3/Makefile 2006-05-30 22:52:38.000000000 +0800 +@@ -4,7 +4,7 @@ + + obj-$(CONFIG_EXT3_FS) += ext3.o + +-ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \ ++ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o iopen.o \ + ioctl.o namei.o super.o symlink.o hash.o resize.o + + ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o +Index: linux-2.6.16.i686/include/linux/ext3_fs.h +=================================================================== +--- linux-2.6.16.i686.orig/include/linux/ext3_fs.h 2006-05-30 22:52:00.000000000 +0800 ++++ linux-2.6.16.i686/include/linux/ext3_fs.h 2006-05-30 22:52:38.000000000 +0800 +@@ -375,6 +375,8 @@ + #define EXT3_MOUNT_QUOTA 0x80000 /* Some quota option set */ + #define EXT3_MOUNT_USRQUOTA 0x100000 /* "old" user quota */ + #define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */ ++#define EXT3_MOUNT_IOPEN 0x400000 /* Allow access via iopen */ ++#define EXT3_MOUNT_IOPEN_NOPRIV 0x800000/* Make iopen world-readable */ + + /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ + #ifndef _LINUX_EXT2_FS_H diff --git a/ldiskfs/kernel_patches/series/ldiskfs-2.6-fc3.series b/ldiskfs/kernel_patches/series/ldiskfs-2.6-fc3.series new file mode 100644 index 0000000..5395486 --- /dev/null +++ b/ldiskfs/kernel_patches/series/ldiskfs-2.6-fc3.series @@ -0,0 +1,22 @@ +ext3-wantedi-2.6-rhel4.patch +ext3-san-jdike-2.6-suse.patch +iopen-2.6-rhel4.patch +export_symbols-ext3-2.6-suse.patch +ext3-map_inode_page-2.6-suse.patch +ext3-ea-in-inode-2.6-rhel4.patch +export-ext3-2.6-rhel4.patch +ext3-include-fixes-2.6-rhel4.patch +ext3-extents-2.6.9-rhel4.patch +ext3-mballoc2-2.6.9-rhel4.patch +ext3-nlinks-2.6.9.patch +ext3-ialloc-2.6.patch +ext3-lookup-dotdot-2.6.9.patch +ext3-tall-htree.patch +ext3-htree-path.patch +ext3-htree-r5-hash.patch +ext3-htree-path-ops.patch +ext3-hash-selection.patch +ext3-htree-comments.patch +ext3-iam-ops.patch +ext3-iam-separate.patch +ext3-iam-uapi.patch diff --git a/ldiskfs/kernel_patches/series/ldiskfs-2.6-fc5.series b/ldiskfs/kernel_patches/series/ldiskfs-2.6-fc5.series new file mode 100644 index 0000000..1c853bd --- /dev/null +++ b/ldiskfs/kernel_patches/series/ldiskfs-2.6-fc5.series @@ -0,0 +1,12 @@ +ext3-wantedi-2.6-rhel4.patch +ext3-san-jdike-2.6-suse.patch +iopen-2.6-fc5.patch +ext3-map_inode_page-2.6-suse.patch +export-ext3-2.6-rhel4.patch +ext3-include-fixes-2.6-rhel4.patch +ext3-extents-2.6.15.patch +ext3-mballoc2-2.6-fc5.patch +ext3-nlinks-2.6.9.patch +ext3-ialloc-2.6.patch +ext3-remove-cond_resched-calls-2.6.12.patch +ext3-filterdata-2.6.15.patch diff --git a/ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel4.series b/ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel4.series index f6d42cd..2f2f413 100644 --- a/ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel4.series +++ b/ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel4.series @@ -16,6 +16,9 @@ ext3-htree-r5-hash.patch ext3-htree-path-ops.patch ext3-hash-selection.patch ext3-htree-comments.patch +ext3-lookup-dotdot-2.6.9.patch +ext3-sector_t-overflow-2.6.9-rhel4.patch +ext3-check-jbd-errors-2.6.9.patch ext3-iam-ops.patch ext3-iam-separate.patch ext3-iam-uapi.patch diff --git a/ldiskfs/kernel_patches/series/ldiskfs-2.6-suse.series b/ldiskfs/kernel_patches/series/ldiskfs-2.6-suse.series index 2584c1d..f3be0ea 100644 --- a/ldiskfs/kernel_patches/series/ldiskfs-2.6-suse.series +++ b/ldiskfs/kernel_patches/series/ldiskfs-2.6-suse.series @@ -12,3 +12,6 @@ ext3-nlinks-2.6.7.patch ext3-rename-reserve-2.6-suse.patch ext3-htree-dot-2.6.5-suse.patch ext3-ialloc-2.6.patch +ext3-lookup-dotdot-2.6.9.patch +ext3-sector_t-overflow-2.6.5-suse.patch +ext3-check-jbd-errors-2.6.5.patch diff --git a/ldiskfs/kernel_patches/series/ldiskfs-2.6.12-vanilla.series b/ldiskfs/kernel_patches/series/ldiskfs-2.6.12-vanilla.series index 7d0a383..53c060b 100644 --- a/ldiskfs/kernel_patches/series/ldiskfs-2.6.12-vanilla.series +++ b/ldiskfs/kernel_patches/series/ldiskfs-2.6.12-vanilla.series @@ -11,3 +11,5 @@ ext3-ialloc-2.6.patch ext3-remove-cond_resched-calls-2.6.12.patch ext3-htree-dot-2.6.patch ext3-external-journal-2.6.12.patch +ext3-lookup-dotdot-2.6.9.patch +ext3-sector_t-overflow-2.6.12.patch diff --git a/ldiskfs/kernel_patches/series/ldiskfs-2.6.18-vanilla.series b/ldiskfs/kernel_patches/series/ldiskfs-2.6.18-vanilla.series new file mode 100644 index 0000000..f379cec --- /dev/null +++ b/ldiskfs/kernel_patches/series/ldiskfs-2.6.18-vanilla.series @@ -0,0 +1,13 @@ +ext3-wantedi-2.6-rhel4.patch +ext3-san-jdike-2.6-suse.patch +iopen-2.6-fc5.patch +ext3-map_inode_page-2.6-suse.patch +export-ext3-2.6-rhel4.patch +ext3-include-fixes-2.6-rhel4.patch +ext3-extents-2.6.18-vanilla.patch +ext3-mballoc2-2.6.18-vanilla.patch +ext3-nlinks-2.6.9.patch +ext3-ialloc-2.6.patch +ext3-remove-cond_resched-calls-2.6.12.patch +ext3-filterdata-2.6.15.patch +ext3-multi-mount-protection-2.6.18-vanilla.patch diff --git a/lustre/ChangeLog b/lustre/ChangeLog index 727f180..6ecae2e 100644 --- a/lustre/ChangeLog +++ b/lustre/ChangeLog @@ -1,5 +1,91 @@ tbd Cluster File Systems, Inc. + * version 1.6.0 + * CONFIGURATION CHANGE. This version of Lustre WILL NOT + INTEROPERATE with older versions automatically. In many cases a + special upgrade step is needed. Please read the + user documentation before upgrading any part of a live system. + * WIRE PROTOCOL CHANGE from previous 1.6 beta versions. This + version will not interoperate with older 1.6 betas. + * WARNING: Lustre configuration and startup changes are required with + this release. See https://mail.clusterfs.com/wikis/lustre/MountConf + for details. + * bug fixes + + +Severity : enhancement +Bugzilla : 4226 +Description: Permanently set tunables +Details : All writable /proc/fs/lustre tunables can now be permanently + set on a per-server basis, at mkfs time or on a live + system. + +Severity : enhancement +Bugzilla : 10547 +Description: Lustre message v2 +Details : Add lustre message format v2. + +Severity : enhancement +Bugzilla : 8007 +Description: MountConf +Details : Lustre configuration is now managed via mkfs and mount + commands instead of lmc and lconf. New obd types (MGS, MGC) + are added for dynamic configuration management. See + https://mail.clusterfs.com/wikis/lustre/MountConf for + details. + +Severity : enhancement +Bugzilla : 4482 +Description: dynamic OST addition +Details : OSTs can now be added to a live filesystem + +Severity : enhancement +Bugzilla : 9851 +Description: startup order invariance +Details : MDTs and OSTs can be started in any order. Clients only + require the MDT to complete startup. + +Severity : enhancement +Bugzilla : 4899 +Description: parallel, asynchronous orphan cleanup +Details : orphan cleanup is now performed in separate threads for each + OST, allowing parallel non-blocking operation. + +Severity : enhancement +Bugzilla : 9862 +Description: optimized stripe assignment +Details : stripe assignments are now made based on ost space available, + ost previous usage, and OSS previous usage, in order to try + to optimize storage space and networking resources. + +Severity : enhancement +Bugzilla : 9866 +Description: client OST exclusion list +Details : Clients can be started with a list of OSTs that should be + declared "inactive" for known non-responsive OSTs. + +Severity : minor +Bugzilla : 6062 +Description: SPEC SFS validation failure on NFS v2 over lustre. +Details : Changes the blocksize for regular files to be 2x RPC size, + and not depend on stripe size. + +Severity : enhancement +Bugzilla : 9293 +Description: Multiple MD RPCs in flight. +Details : Further unserialise some read-only MDS RPCs - learn about intents. + To avoid overly-overloading MDS, introduce a limit on number of + MDS RPCs in flight for a single client and add /proc controls + to adjust this limit. + + +------------------------------------------------------------------------------ + +tbd Cluster File Systems, Inc. * version 1.4.7 + * Support for kernels: + 2.6.9-34.EL (RHEL 4) + 2.6.5-7.252 (SLES 9) + 2.6.12.6 vanilla (kernel.org) * bug fixes Severity : major @@ -45,10 +131,11 @@ Severity : enhancement Bugzilla : 9340 Description: allow number of MDS service threads to be changed at module load Details : It is now possible to change the number of MDS service threads - running. Adding "options mds mds_num_threads=N" will set the - number of threads for the next time Lustre is restarted (assuming - the "mds" module is also reloaded at that time). The default - number of threads will stay the same, 32 for most systems. + running. Adding "options mds mds_num_threads={N}" to the MDS's + /etc/modprobe.conf will set the number of threads for the next + time Lustre is restarted (assuming the "mds" module is also + reloaded at that time). The default number of threads will + stay the same, 32 for most systems. Severity : major Frequency : rare @@ -109,7 +196,7 @@ Details : When running an obd_echo server it did not start the ping_evictor service startup instead of the OBD startup. Severity : enhancement -Bugzilla : 10393 (patchless) +Bugzilla : 10193 (patchless) Description: Remove dependency on various unexported kernel interfaces. Details : No longer need reparent_to_init, exit_mm, exit_files, sock_getsockopt, filemap_populate, FMODE_EXEC, put_filp. @@ -157,6 +244,7 @@ Details : Use asynchronous set_info RPCs to send the "evict_by_nid" to and also offers similar improvements for other set_info RPCs. Severity : minor +Frequency : common Bugzilla : 10265 Description: excessive CPU usage during initial read phase on client Details : During the initial read phase on a client, it would agressively @@ -167,10 +255,257 @@ Details : During the initial read phase on a client, it would agressively /proc/fs/lustre/llite/*/max_read_ahead_whole_mb, 2MB by default). Severity : minor +Frequency : rare Bugzilla : 10450 Description: MDS crash when receiving packet with unknown intent. Details : Do not LBUG in unknown intent case, just return -EFAULT +Severity : enhancement +Bugzilla : 9293, 9385 +Description: MDS RPCs are serialised on client. This is unnecessary for some. +Details : Do not serialize getattr (non-intent version) and statfs. + +Severity : minor +Frequency : occasional, when OST network is overloaded/intermittent +Bugzilla : 10416 +Description: client evicted by OST after bulk IO timeout +Details : If a client sends a bulk IO request (read or write) the OST + may evict the client if it is unresposive to its data GET/PUT + request. This is incorrect if the network is overloaded (takes + too long to transfer the RPC data) or dropped the OST GET/PUT + request. There is no need to evict the client at all, since + the pinger and/or lock callbacks will handle this, and the + client can restart the bulk request. + +Severity : minor +Frequency : Always when mmapping file with no objects +Bugzilla : 10438 +Description: client crashes when mmapping file with no objects +Details : Check that we actually have objects in a file before doing any + operations on objects in ll_vm_open, ll_vm_close and + ll_glimpse_size. + +Severity : minor +Frequency : Rare +Bugzilla : 10484 +Description: Request leak when working with deleted CWD +Details : Introduce advanced request refcount tracking for requests + referenced from lustre intent. + +Severity : Enhancement +Bugzilla : 10482 +Description: Cache open file handles on client. +Details : MDS now will return special lock along with openhandle, if + requested and client is allowed to hold openhandle, even if unused, + until such a lock is revoked. Helps NFS a lot, since NFS is opening + closing files for every read/write openration. + +Severity : Enhancement +Bugzilla : 9291 +Description: Cache open negative dentries on client when possible. +Details : Guard negative dentries with UPDATE lock on parent dir, drop + negative dentries on lock revocation. + +Severity : minor +Frequency : Always +Bugzilla : 10510 +Description: Remounting a client read-only wasn't possible with a zconf mount +Details : It wasn't possible to remount a client read-only with llmount. + +Severity : enhancement +Description: Include MPICH 1.2.6 Lustre ADIO interface patch +Details : In lustre/contrib/ or /usr/share/lustre in RPM a patch for + MPICH is included to add Lustre-specific ADIO interfaces. + This is based closely on the UFS ADIO layer and only differs + in file creation, in order to allow the OST striping to be set. + This is user-contributed code and not supported by CFS. + +Severity : minor +Frequency : Always +Bugzilla : 9486 +Description: extended inode attributes (immutable, append-only) work improperly + when 2.4 and 2.6 kernels are used on client/server or vice versa +Details : Introduce kernel-independent values for these flags. + +Severity : enhancement +Frequency : Always +Bugzilla : 10248 +Description: Allow fractional MB tunings for lustre in /proc/ filesystem. +Details : Many of the /proc/ tunables can only be tuned at a megabyte + granularity. Now, Fractional MB granularity is be supported, + this is very useful for low memory system. + +Severity : enhancement +Bugzilla : 9292 +Description: Getattr by fid +Details : Getting a file attributes by its fid, obtaining UPDATE|LOOKUP + locks, avoids extra getattr rpc requests to MDS, allows '/' to + have locks and avoids getattr rpc requests for it on every stat. + +Severity : major +Frequency : Always, for filesystems larger than 2TB +Bugzilla : 6191 +Description: ldiskfs crash at mount for filesystem larger than 2TB with mballoc +Details : Kenrel kmalloc limits allocations to 128kB and this prevents + filesystems larger than 2TB to be mounted with mballoc enabled. + +Severity : critical +Frequency : Always, for 32-bit kernel without CONFIG_LBD and filesystem > 2TB +Bugzilla : 6191 +Description: ldiskfs crash at mount for filesystem larger than 2TB with mballoc +Details : If a 32-bit kernel is compiled without CONFIG_LBD enabled and a + filesystems larger than 2TB is mounted then the kernel will + silently corrupt the start of the filesystem. CONFIG_LBD is + enabled for all CFS-supported kernels, but the possibility of + this happening with a modified kernel config exists. + +Severity : enhancement +Bugzilla : 10462 +Description: add client O_DIRECT support for 2.6 kernels +Details : It is now possible to do O_DIRECT reads and writes to files + in the Lustre client mountpoint on 2.6 kernel clients. + +Severity : enhancement +Bugzilla : 10446 +Description: parallel glimpse, setattr, statfs, punch, destroy requests +Details : Sends glimpse, setattr, statfs, punch, destroy requests to OSTs in + parallel, not waiting for response from every OST before sending + a rpc to the next OST. + +Severity : minor +Frequency : rare +Bugzilla : 10150 +Description: setattr vs write race when updating file timestamps +Details : Client processes that update a file timestamp into the past + right after writing to the file (e.g. tar) it is possible that + the updated file modification time can be reset to the current + time due to a race between processing the setattr and write RPC. + +Severity : enhancement +Bugzilla : 10318 +Description: Bring 'lfs find' closer in line with regular Linux find. +Details : lfs find util supports -atime, -mtime, -ctime, -maxdepth, -print, + -print0 options and obtains all the needed info through the lustre + ioctls. + +Severity : enhancement +Bugzilla : 6221 +Description: support up to 1024 configured devices on one node +Details : change obd_dev array from statically allocated to dynamically + allocated structs as they are first used to reduce memory usage + +Severity : minor +Frequency : rare +Bugzilla : 10437 +Description: Flush dirty partially truncated pages during truncate +Details : Immediatelly flush partially truncated pages in filter_setattr, + this way we completely avoid having any pages in page cache on OST + and can retire ugly workarounds during writes to flush such pages. + +Severity : minor +Frequency : rare +Bugzilla : 10409 +Description: i_sem vs transaction deadlock in mds_obd_destroy during unlink. +Details : protect inode from truncation within vfs_unlink() context + just take a reference before calling vfs_unlink() and release it + when parent's i_sem is free. + +Severity : major +Frequency : rare +Bugzilla : 4778 +Description: last_id value checked outside lock on OST caused LASSERT failure +Details : If there were multiple MDS->OST object precreate requests in + flight, it was possible that the OST's last object id was checked + outside a lock and incorrectly tripped an assertion. Move checks + inside locks, and discard old precreate requests. + +Severity : minor +Frequency : always, if extents are used on OSTs +Bugzilla : 10703 +Description: index ei_leaf_hi (48-bit extension) is not zeroed in extent index +Details : OSTs using the extents format would not zero the high 16 bits of + the index physical block number. This is not a problem for any + OST filesystems smaller than 16TB, and no kernels support ext3 + filesystems larger than 16TB yet. This is fixed in 1.4.7 (all + new/modified files) and can be fixed for existing filesystems + with e2fsprogs-1.39-cfs1. + +Severity : minor +Frequency : rare +Bugzilla : 9387 +Description: import connection selection may be incorrect if timer wraps +Details : Using a 32-bit jiffies timer with HZ=1000 may cause backup + import connections to be ignored if the 32-bit jiffies counter + wraps. Use a 64-bit jiffies counter. + +Severity : minor +Frequency : very large clusters immediately after boot +Bugzilla : 10083 +Description: LNET request buffers exhausted under heavy short-term load +Details : If a large number of client requests are generated on a service + that has previously never seen so many requests it is possible + that the request buffer growth cannot keep up with the spike in + demand. Instead of dropping incoming requests, they are held in + the LND until the RPC service can accept more requests. + +Severity : minor +Frequency : Sometimes during replay +Bugzilla : 9314 +Description: Assertion failure in ll_local_open after replay. +Details : If replay happened on an open request reply before we were able + to set replay handler, reply will become not swabbed tripping the + assertion in ll_local_open. Now we set the handler right after + recognising of open request + +Severity : minor +Frequency : very rare +Bugzilla : 10669 +Description: Deadlock: extent lock cancellation callback vs import invalidation +Details : If extent lock cancellation callback takes long enough time, and it + happens that import gets invalidated in process, there is a + deadlock on page_lock in extent lock cancellation vs ns_lock in + import invalidation processes. The fix is to not try to match + locks from inactive OSTs. + +Severity : trivial +Frequency : very rare +Bugzilla : 10584 +Description: kernel reports "badness in vsnprintf" +Details : Reading from the "recovery_status" /proc file in small chunks + may cause a negative length in lprocfs_obd_rd_recovery_status() + call to vsnprintf() (which is otherwise harmless). Exit early + if there is no more space in the output buffer. + +Severity : enhancement +Bugzilla : 2259 +Description: clear OBD RPC statistics by writing to them +Details : It is now possible to clear the OBD RPC statistics by writing + to the "stats" file. + +Severity : minor +Frequency : always +Bugzilla : 10611 +Description: Inability to activate failout mode +Details : lconf script incorrectly assumed that in pythong string's numeric + value is used in comparisons. + +Severity : minor +Frequency : always with multiple stripes per file +Bugzilla : 10671 +Description: Inefficient object allocation for mutli-stripe files +Details : When selecting which OSTs to stripe files over, for files with + a stripe count that divides evenly into the number of OSTs, + the MDS is always picking the same starting OST for each file. + Return the OST selection heuristic to the original design. + +Severity : trivial +Frequency : rare +Bugzilla : 10673 +Description: mount failures may take full timeout to return an error +Details : Under some heavy load conditions it is possible that a + failed mount can wait for the full obd_timeout interval, + possibly several minutes, before reporting an error. + Instead return an error as soon as the status is known. ------------------------------------------------------------------------------ @@ -183,9 +518,9 @@ Details : Do not LBUG in unknown intent case, just return -EFAULT this release. See https://bugzilla.clusterfs.com/show_bug.cgi?id=10052 for details. * bug fixes - * Support for newer kernels: - 2.6.9-22.0.2.EL (RHEL 4), - 2.6.5-7.244 (SLES 9) - same as 1.4.5.2. + * Support for kernels: + 2.6.9-22.0.2.EL (RHEL 4) + 2.6.5-7.244 (SLES 9) 2.6.12.6 vanilla (kernel.org) @@ -574,8 +909,8 @@ Severity : enhancement Bugzilla : 4928, 7341, 9758 Description: allow number of OST service threads to be specified Details : a module parameter allows the number of OST service threads - to be specified via "options ost ost_num_threads=X" in - /etc/modules.conf or /etc/modutils.conf. + to be specified via "options ost ost_num_threads={N}" in the + OSS's /etc/modules.conf or /etc/modprobe.conf. Severity : major Frequency : rare @@ -700,6 +1035,14 @@ Details : If a client is repeatedly creating and unlinking files it client node to run out of memory. Instead flush old inodes from client cache that have the same inode number as a new inode. +Severity : minor +Frequency : SLES9 2.6.5 kernel and long filenames only +Bugzilla : 9969, 10379 +Description: utime reports stale NFS file handle +Details : SLES9 uses out-of-dentry names in some cases, which confused + the lustre dentry revalidation. Change it to always use the + in-dentry qstr. + Severity : major Frequency : rare, unless heavy write-truncate concurrency is continuous Bugzilla : 4180, 6984, 7171, 9963, 9331 diff --git a/lustre/autoMakefile.am b/lustre/autoMakefile.am index c9e56a5..3b06838 100644 --- a/lustre/autoMakefile.am +++ b/lustre/autoMakefile.am @@ -5,12 +5,13 @@ AUTOMAKE_OPTIONS = foreign -ALWAYS_SUBDIRS := include fid lvfs obdclass ldlm ptlrpc osc lov obdecho \ - mgc doc utils tests conf scripts autoconf +# also update lustre/autoconf/lustre-core.m4 AC_CONFIG_FILES +ALWAYS_SUBDIRS := include lvfs obdclass ldlm ptlrpc osc lov obdecho \ + mgc fid fld doc utils tests scripts autoconf contrib SERVER_SUBDIRS := ldiskfs obdfilter ost mds mgs mdt cmm mdd osd -CLIENT_SUBDIRS := mdc lmv llite fld +CLIENT_SUBDIRS := mdc lmv llite QUOTA_SUBDIRS := quota @@ -58,27 +59,11 @@ sources: $(LDISKFS) lvfs-sources obdclass-sources lustre_build_version all-recursive: lustre_build_version +BUILD_VER_H=$(top_builddir)/lustre/include/linux/lustre_build_version.h + lustre_build_version: perl $(top_builddir)/lustre/scripts/version_tag.pl $(top_srcdir) $(top_builddir) > tmpver echo "#define LUSTRE_RELEASE @RELEASE@" >> tmpver - cmp -s $(top_builddir)/lustre/include/linux/lustre_build_version.h tmpver \ - 2> /dev/null && \ - $(RM) tmpver || \ - mv tmpver $(top_builddir)/lustre/include/linux/lustre_build_version.h - -CSTK=/tmp/checkstack -CSTKO=/tmp/checkstack.orig - -checkstack: - [ -f ${CSTK} -a ! -s ${CSTKO} ] && mv ${CSTK} ${CSTKO} || true - for i in ${SUBDIRS} lnet/klnds/*; do \ - MOD=$$i/`basename $$i`.o; \ - [ -f $$MOD ] && objdump -d $$MOD | perl tests/checkstack.pl; \ - done | sort -nr > ${CSTK} - [ -f ${CSTKO} ] && ! diff -u ${CSTKO} ${CSTK} || head -30 ${CSTK} - -checkstack-update: - [ -f ${CSTK} ] && mv ${CSTK} ${CSTKO} - -checkstack-clean: - rm -f ${CSTK} ${CSTKO} + cmp -s $(BUILD_VER_H) tmpver > tmpdiff 2> /dev/null && \ + $(RM) tmpver tmpdiff || \ + mv -f tmpver $(BUILD_VER_H) diff --git a/lustre/autoconf/lustre-core.m4 b/lustre/autoconf/lustre-core.m4 index 47a58f5..a7e1fc3 100644 --- a/lustre/autoconf/lustre-core.m4 +++ b/lustre/autoconf/lustre-core.m4 @@ -26,9 +26,6 @@ AC_SUBST(demodir) pkgexampledir='${pkgdatadir}/examples' AC_SUBST(pkgexampledir) - -pymoddir='${pkglibdir}/python/Lustre' -AC_SUBST(pymoddir) ]) # @@ -363,8 +360,12 @@ case $BACKINGFS in case $LINUXRELEASE in 2.6.5*) LDISKFS_SERIES="2.6-suse.series" ;; 2.6.9*) LDISKFS_SERIES="2.6-rhel4.series" ;; + 2.6.10-ac*) LDISKFS_SERIES="2.6-fc3.series" ;; 2.6.10*) LDISKFS_SERIES="2.6-rhel4.series" ;; 2.6.12*) LDISKFS_SERIES="2.6.12-vanilla.series" ;; + 2.6.15*) LDISKFS_SERIES="2.6-fc5.series";; + 2.6.16*) LDISKFS_SERIES="2.6-fc5.series";; + 2.6.18*) LDISKFS_SERIES="2.6.18-vanilla.series";; *) AC_MSG_WARN([Unknown kernel version $LINUXRELEASE, fix lustre/autoconf/lustre-core.m4]) esac AC_MSG_RESULT([$LDISKFS_SERIES]) @@ -487,13 +488,111 @@ LB_LINUX_TRY_COMPILE([ ]) ]) +AC_DEFUN([LC_BIT_SPINLOCK_H], +[LB_CHECK_FILE([$LINUX/include/linux/bit_spinlock.h],[ + AC_MSG_CHECKING([if bit_spinlock.h can be compiled]) + LB_LINUX_TRY_COMPILE([ + #include + #include + #include + ],[],[ + AC_MSG_RESULT([yes]) + AC_DEFINE(HAVE_BIT_SPINLOCK_H, 1, [Kernel has bit_spinlock.h]) + ],[ + AC_MSG_RESULT([no]) + ]) +], +[]) +]) + +# +# LC_POSIX_ACL_XATTR +# +# If we have xattr_acl.h +# +AC_DEFUN([LC_XATTR_ACL], +[LB_CHECK_FILE([$LINUX/include/linux/xattr_acl.h],[ + AC_MSG_CHECKING([if xattr_acl.h can be compiled]) + LB_LINUX_TRY_COMPILE([ + #include + ],[],[ + AC_MSG_RESULT([yes]) + AC_DEFINE(HAVE_XATTR_ACL, 1, [Kernel has xattr_acl]) + ],[ + AC_MSG_RESULT([no]) + ]) +], +[]) +]) + +AC_DEFUN([LC_STRUCT_INTENT_FILE], +[AC_MSG_CHECKING([if struct open_intent has a file field]) +LB_LINUX_TRY_COMPILE([ + #include + #include +],[ + struct open_intent intent; + &intent.file; +],[ + AC_MSG_RESULT([yes]) + AC_DEFINE(HAVE_FILE_IN_STRUCT_INTENT, 1, [struct open_intent has a file field]) +],[ + AC_MSG_RESULT([no]) +]) +]) + + +AC_DEFUN([LC_POSIX_ACL_XATTR_H], +[LB_CHECK_FILE([$LINUX/include/linux/posix_acl_xattr.h],[ + AC_MSG_CHECKING([if linux/posix_acl_xattr.h can be compiled]) + LB_LINUX_TRY_COMPILE([ + #include + ],[],[ + AC_MSG_RESULT([yes]) + AC_DEFINE(HAVE_LINUX_POSIX_ACL_XATTR_H, 1, [linux/posix_acl_xattr.h found]) + + ],[ + AC_MSG_RESULT([no]) + ]) +$1 +],[ +AC_MSG_RESULT([no]) +]) +]) + +AC_DEFUN([LC_LUSTRE_VERSION_H], +[LB_CHECK_FILE([$LINUX/include/linux/lustre_version.h],[ + rm -f "$LUSTRE/include/linux/lustre_version.h" +],[ + touch "$LUSTRE/include/linux/lustre_version.h" + if test x$enable_server = xyes ; then + AC_MSG_WARN([Patchless build detected, disabling server building]) + enable_server='no' + fi +]) +]) + +AC_DEFUN([LC_FUNC_SET_FS_PWD], +[AC_MSG_CHECKING([if kernel exports show_task]) +have_show_task=0 + if grep -q "EXPORT_SYMBOL(show_task)" \ + "$LINUX/fs/namespace.c" 2>/dev/null ; then + AC_DEFINE(HAVE_SET_FS_PWD, 1, [set_fs_pwd is exported]) + AC_MSG_RESULT([yes]) + else + AC_MSG_RESULT([no]) + fi +]) + + # # LC_PROG_LINUX # # Lustre linux kernel checks # AC_DEFUN([LC_PROG_LINUX], -[if test x$enable_server = xyes ; then +[ LC_LUSTRE_VERSION_H +if test x$enable_server = xyes ; then LC_CONFIG_BACKINGFS fi LC_CONFIG_PINGER @@ -515,6 +614,11 @@ LC_FUNC_PAGE_MAPPED LC_STRUCT_FILE_OPS_UNLOCKED_IOCTL LC_FILEMAP_POPULATE LC_D_ADD_UNIQUE +LC_BIT_SPINLOCK_H +LC_XATTR_ACL +LC_STRUCT_INTENT_FILE +LC_POSIX_ACL_XATTR_H +LC_FUNC_SET_FS_PWD ]) # @@ -602,7 +706,7 @@ AC_DEFUN([LC_CONFIGURE], [LC_CONFIG_OBD_BUFFER_SIZE # include/liblustre.h -AC_CHECK_HEADERS([asm/page.h sys/user.h sys/vfs.h stdint.h]) +AC_CHECK_HEADERS([asm/page.h sys/user.h sys/vfs.h stdint.h blkid/blkid.h]) # include/lustre/lustre_user.h # See note there re: __ASM_X86_64_PROCESSOR_H @@ -618,12 +722,8 @@ AC_CHECK_HEADERS([linux/types.h sys/types.h linux/unistd.h unistd.h]) AC_CHECK_HEADERS([netinet/in.h arpa/inet.h catamount/data.h]) AC_CHECK_FUNCS([inet_ntoa]) -# llite/xattr.c -AC_CHECK_HEADERS([linux/xattr_acl.h]) - -# use universal lustre headers -# i.e: include/obd.h instead of include/linux/obd.h -AC_CHECK_FILE($PWD/lustre/include/obd.h, [AC_DEFINE(UNIV_LUSTRE_HEADERS, 1, [Use universal lustre headers])]) +# utils/llverfs.c +AC_CHECK_HEADERS([ext2fs/ext2fs.h]) # Super safe df AC_ARG_ENABLE([mindf], @@ -650,6 +750,8 @@ AM_CONDITIONAL(MPITESTS, test x$enable_mpitests = xyes, Build MPI Tests) AM_CONDITIONAL(CLIENT, test x$enable_client = xyes) AM_CONDITIONAL(SERVER, test x$enable_server = xyes) AM_CONDITIONAL(QUOTA, test x$enable_quota = xyes) +AM_CONDITIONAL(BLKID, test x$ac_cv_header_blkid_blkid_h = xyes) +AM_CONDITIONAL(EXT2FS_DEVEL, test x$ac_cv_header_ext2fs_ext2fs_h = xyes) ]) # @@ -662,7 +764,7 @@ AC_DEFUN([LC_CONFIG_FILES], lustre/Makefile lustre/autoMakefile lustre/autoconf/Makefile -lustre/conf/Makefile +lustre/contrib/Makefile lustre/doc/Makefile lustre/include/Makefile lustre/include/lustre_ver.h @@ -671,6 +773,8 @@ lustre/include/lustre/Makefile lustre/kernel_patches/targets/2.6-suse.target lustre/kernel_patches/targets/2.6-vanilla.target lustre/kernel_patches/targets/2.6-rhel4.target +lustre/kernel_patches/targets/2.6-fc5.target +lustre/kernel_patches/targets/2.6-patchless.target lustre/kernel_patches/targets/hp_pnnl-2.4.target lustre/kernel_patches/targets/rh-2.4.target lustre/kernel_patches/targets/rhel-2.4.target @@ -727,7 +831,6 @@ lustre/quota/autoMakefile lustre/scripts/Makefile lustre/scripts/version_tag.pl lustre/tests/Makefile -lustre/utils/Lustre/Makefile lustre/utils/Makefile ]) case $lb_target_os in diff --git a/lustre/contrib/.cvsignore b/lustre/contrib/.cvsignore new file mode 100644 index 0000000..282522d --- /dev/null +++ b/lustre/contrib/.cvsignore @@ -0,0 +1,2 @@ +Makefile +Makefile.in diff --git a/lustre/contrib/Makefile.am b/lustre/contrib/Makefile.am new file mode 100644 index 0000000..5a8e66c --- /dev/null +++ b/lustre/contrib/Makefile.am @@ -0,0 +1,5 @@ +# Contributions Makefile + +EXTRA_DIST = mpich-*.patch +pkgdata_DATA = $(EXTRA_DIST) + diff --git a/lustre/contrib/README b/lustre/contrib/README new file mode 100644 index 0000000..73270f3 --- /dev/null +++ b/lustre/contrib/README @@ -0,0 +1,2 @@ +The files in this directory are user-contributed and are not supported by +CFS in any way. diff --git a/lustre/contrib/mpich-1.2.6-lustre.patch b/lustre/contrib/mpich-1.2.6-lustre.patch new file mode 100644 index 0000000..d32fab9 --- /dev/null +++ b/lustre/contrib/mpich-1.2.6-lustre.patch @@ -0,0 +1,1829 @@ +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.c 2005-12-06 11:54:37.883130927 -0500 +@@ -0,0 +1,37 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 2001 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++/* adioi.h has the ADIOI_Fns_struct define */ ++#include "adioi.h" ++ ++struct ADIOI_Fns_struct ADIO_LUSTRE_operations = { ++ ADIOI_LUSTRE_Open, /* Open */ ++ ADIOI_LUSTRE_ReadContig, /* ReadContig */ ++ ADIOI_LUSTRE_WriteContig, /* WriteContig */ ++ ADIOI_GEN_ReadStridedColl, /* ReadStridedColl */ ++ ADIOI_GEN_WriteStridedColl, /* WriteStridedColl */ ++ ADIOI_GEN_SeekIndividual, /* SeekIndividual */ ++ ADIOI_LUSTRE_Fcntl, /* Fcntl */ ++ ADIOI_LUSTRE_SetInfo, /* SetInfo */ ++ ADIOI_GEN_ReadStrided, /* ReadStrided */ ++ ADIOI_GEN_WriteStrided, /* WriteStrided */ ++ ADIOI_LUSTRE_Close, /* Close */ ++ ADIOI_LUSTRE_IreadContig, /* IreadContig */ ++ ADIOI_LUSTRE_IwriteContig, /* IwriteContig */ ++ ADIOI_LUSTRE_ReadDone, /* ReadDone */ ++ ADIOI_LUSTRE_WriteDone, /* WriteDone */ ++ ADIOI_LUSTRE_ReadComplete, /* ReadComplete */ ++ ADIOI_LUSTRE_WriteComplete, /* WriteComplete */ ++ ADIOI_LUSTRE_IreadStrided, /* IreadStrided */ ++ ADIOI_LUSTRE_IwriteStrided, /* IwriteStrided */ ++ ADIOI_GEN_Flush, /* Flush */ ++ ADIOI_LUSTRE_Resize, /* Resize */ ++ ADIOI_GEN_Delete, /* Delete */ ++}; +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_close.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_close.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_close.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_close.c 2005-12-06 11:54:37.895129327 -0500 +@@ -0,0 +1,32 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_close.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_Close(ADIO_File fd, int *error_code) ++{ ++ int err; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_CLOSE"; ++#endif ++ ++ err = close(fd->fd_sys); ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_done.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_done.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_done.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_done.c 2005-12-06 11:54:37.898128927 -0500 +@@ -0,0 +1,188 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_done.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++int ADIOI_LUSTRE_ReadDone(ADIO_Request *request, ADIO_Status *status, int *error_code) ++{ ++#ifndef NO_AIO ++ int done=0; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_READDONE"; ++#endif ++#ifdef AIO_SUN ++ aio_result_t *result=0, *tmp; ++#else ++ int err; ++#endif ++#ifdef AIO_HANDLE_IN_AIOCB ++ struct aiocb *tmp1; ++#endif ++#endif ++ ++ if (*request == ADIO_REQUEST_NULL) { ++ *error_code = MPI_SUCCESS; ++ return 1; ++ } ++ ++#ifdef NO_AIO ++/* HP, FreeBSD, Linux */ ++#ifdef HAVE_STATUS_SET_BYTES ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++ (*request)->fd->async_count--; ++ ADIOI_Free_request((ADIOI_Req_node *) (*request)); ++ *request = ADIO_REQUEST_NULL; ++ *error_code = MPI_SUCCESS; ++ return 1; ++#endif ++ ++#ifdef AIO_SUN ++ if ((*request)->queued) { ++ tmp = (aio_result_t *) (*request)->handle; ++ if (tmp->aio_return == AIO_INPROGRESS) { ++ done = 0; ++ *error_code = MPI_SUCCESS; ++ } ++ else if (tmp->aio_return != -1) { ++ result = (aio_result_t *) aiowait(0); /* dequeue any one request */ ++ done = 1; ++ (*request)->nbytes = tmp->aio_return; ++ *error_code = MPI_SUCCESS; ++ } ++ else { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(tmp->aio_errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(tmp->aio_errno)); ++ ADIOI_Error((*request)->fd, *error_code, myname); ++#endif ++ } ++ } /* if ((*request)->queued) ... */ ++ else { ++ /* ADIOI_Complete_Async completed this request, but request object ++ was not freed. */ ++ done = 1; ++ *error_code = MPI_SUCCESS; ++ } ++#ifdef HAVE_STATUS_SET_BYTES ++ if (done && ((*request)->nbytes != -1)) ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++ ++#endif ++ ++#ifdef AIO_HANDLE_IN_AIOCB ++/* IBM */ ++ if ((*request)->queued) { ++ tmp1 = (struct aiocb *) (*request)->handle; ++ errno = aio_error(tmp1->aio_handle); ++ if (errno == EINPROG) { ++ done = 0; ++ *error_code = MPI_SUCCESS; ++ } ++ else { ++ err = aio_return(tmp1->aio_handle); ++ (*request)->nbytes = err; ++ errno = aio_error(tmp1->aio_handle); ++ ++ done = 1; ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error((*request)->fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++ } ++ } /* if ((*request)->queued) */ ++ else { ++ done = 1; ++ *error_code = MPI_SUCCESS; ++ } ++#ifdef HAVE_STATUS_SET_BYTES ++ if (done && ((*request)->nbytes != -1)) ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++ ++#elif (!defined(NO_AIO) && !defined(AIO_SUN)) ++/* DEC, SGI IRIX 5 and 6 */ ++ if ((*request)->queued) { ++ errno = aio_error((const struct aiocb *) (*request)->handle); ++ if (errno == EINPROGRESS) { ++ done = 0; ++ *error_code = MPI_SUCCESS; ++ } ++ else { ++ err = aio_return((struct aiocb *) (*request)->handle); ++ (*request)->nbytes = err; ++ errno = aio_error((struct aiocb *) (*request)->handle); ++ ++ done = 1; ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error((*request)->fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++ } ++ } /* if ((*request)->queued) */ ++ else { ++ done = 1; ++ *error_code = MPI_SUCCESS; ++ } ++#ifdef HAVE_STATUS_SET_BYTES ++ if (done && ((*request)->nbytes != -1)) ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++ ++#endif ++ ++#ifndef NO_AIO ++ if (done) { ++ /* if request is still queued in the system, it is also there ++ on ADIOI_Async_list. Delete it from there. */ ++ if ((*request)->queued) ADIOI_Del_req_from_list(request); ++ ++ (*request)->fd->async_count--; ++ if ((*request)->handle) ADIOI_Free((*request)->handle); ++ ADIOI_Free_request((ADIOI_Req_node *) (*request)); ++ *request = ADIO_REQUEST_NULL; ++ } ++ return done; ++#endif ++ ++} ++ ++ ++int ADIOI_LUSTRE_WriteDone(ADIO_Request *request, ADIO_Status *status, int *error_code) ++{ ++ return ADIOI_LUSTRE_ReadDone(request, status, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_fcntl.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_fcntl.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_fcntl.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_fcntl.c 2005-12-06 11:54:37.901128527 -0500 +@@ -0,0 +1,126 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_fcntl.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++#include "adio_extern.h" ++/* #ifdef MPISGI ++#include "mpisgi2.h" ++#endif */ ++ ++void ADIOI_LUSTRE_Fcntl(ADIO_File fd, int flag, ADIO_Fcntl_t *fcntl_struct, int *error_code) ++{ ++ int i, ntimes; ++ ADIO_Offset curr_fsize, alloc_size, size, len, done; ++ ADIO_Status status; ++ char *buf; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_FCNTL"; ++#endif ++ ++ switch(flag) { ++ case ADIO_FCNTL_GET_FSIZE: ++ fcntl_struct->fsize = lseek(fd->fd_sys, 0, SEEK_END); ++ if (fd->fp_sys_posn != -1) ++ lseek(fd->fd_sys, fd->fp_sys_posn, SEEK_SET); ++ if (fcntl_struct->fsize == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++ break; ++ ++ case ADIO_FCNTL_SET_DISKSPACE: ++ /* will be called by one process only */ ++ /* On file systems with no preallocation function, I have to ++ explicitly write ++ to allocate space. Since there could be holes in the file, ++ I need to read up to the current file size, write it back, ++ and then write beyond that depending on how much ++ preallocation is needed. ++ read/write in sizes of no more than ADIOI_PREALLOC_BUFSZ */ ++ ++ curr_fsize = lseek(fd->fd_sys, 0, SEEK_END); ++ alloc_size = fcntl_struct->diskspace; ++ ++ size = ADIOI_MIN(curr_fsize, alloc_size); ++ ++ ntimes = (size + ADIOI_PREALLOC_BUFSZ - 1)/ADIOI_PREALLOC_BUFSZ; ++ buf = (char *) ADIOI_Malloc(ADIOI_PREALLOC_BUFSZ); ++ done = 0; ++ ++ for (i=0; i curr_fsize) { ++ memset(buf, 0, ADIOI_PREALLOC_BUFSZ); ++ size = alloc_size - curr_fsize; ++ ntimes = (size + ADIOI_PREALLOC_BUFSZ - 1)/ADIOI_PREALLOC_BUFSZ; ++ for (i=0; ifp_sys_posn != -1) ++ lseek(fd->fd_sys, fd->fp_sys_posn, SEEK_SET); ++ *error_code = MPI_SUCCESS; ++ break; ++ ++ case ADIO_FCNTL_SET_IOMODE: ++ /* for implementing PFS I/O modes. will not occur in MPI-IO ++ implementation.*/ ++ if (fd->iomode != fcntl_struct->iomode) { ++ fd->iomode = fcntl_struct->iomode; ++ MPI_Barrier(MPI_COMM_WORLD); ++ } ++ *error_code = MPI_SUCCESS; ++ break; ++ ++ case ADIO_FCNTL_SET_ATOMICITY: ++ fd->atomicity = (fcntl_struct->atomicity == 0) ? 0 : 1; ++ *error_code = MPI_SUCCESS; ++ break; ++ ++ default: ++ FPRINTF(stderr, "Unknown flag passed to ADIOI_LUSTRE_Fcntl\n"); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_flush.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_flush.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_flush.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_flush.c 2005-12-06 11:54:37.903128261 -0500 +@@ -0,0 +1,14 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_flush.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_Flush(ADIO_File fd, int *error_code) ++{ ++ ADIOI_GEN_Flush(fd, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.h mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.h +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.h 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre.h 2005-12-06 11:54:37.891129861 -0500 +@@ -0,0 +1,36 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre.h,v 1.2 2005/07/07 14:38:17 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#ifndef AD_UNIX_INCLUDE ++#define AD_UNIX_INCLUDE ++ ++/* temp*/ ++#define HAVE_ASM_TYPES_H 1 ++ ++#include ++#include ++#include ++#include ++#include "lustre/lustre_user.h" ++#include "adio.h" ++ ++#ifndef NO_AIO ++#ifdef AIO_SUN ++#include ++#else ++#include ++#ifdef NEEDS_ADIOCB_T ++typedef struct adiocb adiocb_t; ++#endif ++#endif ++#endif ++ ++int ADIOI_LUSTRE_aio(ADIO_File fd, void *buf, int len, ADIO_Offset offset, ++ int wr, void *handle); ++ ++#endif +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_hints.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_hints.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_hints.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_hints.c 2005-12-06 11:54:37.904128127 -0500 +@@ -0,0 +1,130 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_hints.c,v 1.2 2005/07/07 14:38:17 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_SetInfo(ADIO_File fd, MPI_Info users_info, int *error_code) ++{ ++ char *value, *value_in_fd; ++ int flag, tmp_val, str_factor=-1, str_unit=0, start_iodev=-1; ++ struct lov_user_md lum = { 0 }; ++ int err, myrank, fd_sys, perm, amode, old_mask; ++ ++ if ( (fd->info) == MPI_INFO_NULL) { ++ /* This must be part of the open call. can set striping parameters ++ if necessary. */ ++ MPI_Info_create(&(fd->info)); ++ ++ /* has user specified striping or server buffering parameters ++ and do they have the same value on all processes? */ ++ if (users_info != MPI_INFO_NULL) { ++ value = (char *) ADIOI_Malloc((MPI_MAX_INFO_VAL+1)*sizeof(char)); ++ ++ MPI_Info_get(users_info, "striping_factor", MPI_MAX_INFO_VAL, ++ value, &flag); ++ if (flag) { ++ str_factor=atoi(value); ++ tmp_val = str_factor; ++ MPI_Bcast(&tmp_val, 1, MPI_INT, 0, fd->comm); ++ if (tmp_val != str_factor) { ++ FPRINTF(stderr, "ADIOI_LUSTRE_SetInfo: the value for key \"striping_factor\" must be the same on all processes\n"); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ ++ MPI_Info_get(users_info, "striping_unit", MPI_MAX_INFO_VAL, ++ value, &flag); ++ if (flag) { ++ str_unit=atoi(value); ++ tmp_val = str_unit; ++ MPI_Bcast(&tmp_val, 1, MPI_INT, 0, fd->comm); ++ if (tmp_val != str_unit) { ++ FPRINTF(stderr, "ADIOI_LUSTRE_SetInfo: the value for key \"striping_unit\" must be the same on all processes\n"); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ ++ MPI_Info_get(users_info, "start_iodevice", MPI_MAX_INFO_VAL, ++ value, &flag); ++ if (flag) { ++ start_iodev=atoi(value); ++ tmp_val = start_iodev; ++ MPI_Bcast(&tmp_val, 1, MPI_INT, 0, fd->comm); ++ if (tmp_val != start_iodev) { ++ FPRINTF(stderr, "ADIOI_LUSTRE_SetInfo: the value for key \"start_iodevice\" must be the same on all processes\n"); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ ++ /* if user has specified striping info, process 0 tries to set it */ ++ if ((str_factor > 0) || (str_unit > 0) || (start_iodev >= 0)) { ++ MPI_Comm_rank(fd->comm, &myrank); ++ if (!myrank) { ++ if (fd->perm == ADIO_PERM_NULL) { ++ old_mask = umask(022); ++ umask(old_mask); ++ perm = old_mask ^ 0666; ++ } ++ else perm = fd->perm; ++ ++ amode = 0; ++ if (fd->access_mode & ADIO_CREATE) ++ amode = amode | O_CREAT; ++ if (fd->access_mode & ADIO_RDWR || ++ (fd->access_mode & ADIO_RDONLY && ++ fd->access_mode & ADIO_WRONLY)) ++ amode = amode | O_RDWR; ++ else if (fd->access_mode & ADIO_WRONLY) ++ amode = amode | O_WRONLY; ++ else if (fd->access_mode & ADIO_RDONLY) ++ amode = amode | O_RDONLY; ++ if (fd->access_mode & ADIO_EXCL) ++ amode = amode | O_EXCL; ++ ++ /* we need to create file so ensure this is set */ ++ amode = amode | O_LOV_DELAY_CREATE | O_CREAT; ++ ++ fd_sys = open(fd->filename, amode, perm); ++ if (fd_sys == -1) { ++ if (errno != EEXIST) ++ FPRINTF(stderr, "ADIOI_LUSTRE_SetInfo: Failure to open file %s %d %d\n",strerror(errno), amode, perm); ++ } else { ++ lum.lmm_magic = LOV_USER_MAGIC; ++ lum.lmm_pattern = 0; ++ lum.lmm_stripe_size = str_unit; ++ lum.lmm_stripe_count = str_factor; ++ lum.lmm_stripe_offset = start_iodev; ++ ++ err = ioctl(fd_sys, LL_IOC_LOV_SETSTRIPE, &lum); ++ if (err == -1 && errno != EEXIST) { ++ FPRINTF(stderr, "ADIOI_LUSTRE_SetInfo: Failure to set stripe info %s \n",strerror(errno)); ++ } ++ ++ close(fd_sys); ++ } ++ ++ } ++ MPI_Barrier(fd->comm); ++ } ++ ++ ADIOI_Free(value); ++ } ++ ++ /* set the values for collective I/O and data sieving parameters */ ++ ADIOI_GEN_SetInfo(fd, users_info, error_code); ++ } ++ ++ else { ++ /* The file has been opened previously and fd->fd_sys is a valid ++ file descriptor. cannot set striping parameters now. */ ++ ++ /* set the values for collective I/O and data sieving parameters */ ++ ADIOI_GEN_SetInfo(fd, users_info, error_code); ++ ++ } ++ ++ *error_code = MPI_SUCCESS; ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iread.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iread.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iread.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iread.c 2005-12-06 11:54:37.904128127 -0500 +@@ -0,0 +1,106 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_iread.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_IreadContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int *error_code) ++{ ++ int len, typesize; ++#ifdef NO_AIO ++ ADIO_Status status; ++#else ++ int err=-1; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_IREADCONTIG"; ++#endif ++#endif ++ ++ (*request) = ADIOI_Malloc_request(); ++ (*request)->optype = ADIOI_READ; ++ (*request)->fd = fd; ++ (*request)->datatype = datatype; ++ ++ MPI_Type_size(datatype, &typesize); ++ len = count * typesize; ++ ++#ifdef NO_AIO ++ /* HP, FreeBSD, Linux */ ++ /* no support for nonblocking I/O. Use blocking I/O. */ ++ ++ ADIOI_LUSTRE_ReadContig(fd, buf, len, MPI_BYTE, file_ptr_type, offset, ++ &status, error_code); ++ (*request)->queued = 0; ++#ifdef HAVE_STATUS_SET_BYTES ++ if (*error_code == MPI_SUCCESS) { ++ MPI_Get_elements(&status, MPI_BYTE, &len); ++ (*request)->nbytes = len; ++ } ++#endif ++ ++#else ++ if (file_ptr_type == ADIO_INDIVIDUAL) offset = fd->fp_ind; ++ err = ADIOI_LUSTRE_aio(fd, buf, len, offset, 0, &((*request)->handle)); ++ if (file_ptr_type == ADIO_INDIVIDUAL) fd->fp_ind += len; ++ ++ (*request)->queued = 1; ++ ADIOI_Add_req_to_list(request); ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++#endif /* NO_AIO */ ++ ++ fd->fp_sys_posn = -1; /* set it to null. */ ++ fd->async_count++; ++} ++ ++ ++ ++void ADIOI_LUSTRE_IreadStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int ++ *error_code) ++{ ++ ADIO_Status status; ++#ifdef HAVE_STATUS_SET_BYTES ++ int typesize; ++#endif ++ ++ *request = ADIOI_Malloc_request(); ++ (*request)->optype = ADIOI_READ; ++ (*request)->fd = fd; ++ (*request)->datatype = datatype; ++ (*request)->queued = 0; ++ (*request)->handle = 0; ++ ++/* call the blocking version. It is faster because it does data sieving. */ ++ ADIOI_LUSTRE_ReadStrided(fd, buf, count, datatype, file_ptr_type, ++ offset, &status, error_code); ++ ++ fd->async_count++; ++ ++#ifdef HAVE_STATUS_SET_BYTES ++ if (*error_code == MPI_SUCCESS) { ++ MPI_Type_size(datatype, &typesize); ++ (*request)->nbytes = count * typesize; ++ } ++#endif ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iwrite.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iwrite.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iwrite.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_iwrite.c 2005-12-06 11:54:37.906127861 -0500 +@@ -0,0 +1,268 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_iwrite.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_IwriteContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int *error_code) ++{ ++ int len, typesize; ++#ifdef NO_AIO ++ ADIO_Status status; ++#else ++ int err=-1; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_IWRITECONTIG"; ++#endif ++#endif ++ ++ *request = ADIOI_Malloc_request(); ++ (*request)->optype = ADIOI_WRITE; ++ (*request)->fd = fd; ++ (*request)->datatype = datatype; ++ ++ MPI_Type_size(datatype, &typesize); ++ len = count * typesize; ++ ++#ifdef NO_AIO ++ /* HP, FreeBSD, Linux */ ++ /* no support for nonblocking I/O. Use blocking I/O. */ ++ ++ ADIOI_LUSTRE_WriteContig(fd, buf, len, MPI_BYTE, file_ptr_type, offset, ++ &status, error_code); ++ (*request)->queued = 0; ++#ifdef HAVE_STATUS_SET_BYTES ++ if (*error_code == MPI_SUCCESS) { ++ MPI_Get_elements(&status, MPI_BYTE, &len); ++ (*request)->nbytes = len; ++ } ++#endif ++ ++#else ++ if (file_ptr_type == ADIO_INDIVIDUAL) offset = fd->fp_ind; ++ err = ADIOI_LUSTRE_aio(fd, buf, len, offset, 1, &((*request)->handle)); ++ if (file_ptr_type == ADIO_INDIVIDUAL) fd->fp_ind += len; ++ ++ (*request)->queued = 1; ++ ADIOI_Add_req_to_list(request); ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++#endif /* NO_AIO */ ++ ++ fd->fp_sys_posn = -1; /* set it to null. */ ++ fd->async_count++; ++} ++ ++ ++ ++ ++void ADIOI_LUSTRE_IwriteStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int ++ *error_code) ++{ ++ ADIO_Status status; ++#ifdef HAVE_STATUS_SET_BYTES ++ int typesize; ++#endif ++ ++ *request = ADIOI_Malloc_request(); ++ (*request)->optype = ADIOI_WRITE; ++ (*request)->fd = fd; ++ (*request)->datatype = datatype; ++ (*request)->queued = 0; ++ (*request)->handle = 0; ++ ++/* call the blocking version. It is faster because it does data sieving. */ ++ ADIOI_LUSTRE_WriteStrided(fd, buf, count, datatype, file_ptr_type, ++ offset, &status, error_code); ++ ++ fd->async_count++; ++ ++#ifdef HAVE_STATUS_SET_BYTES ++ if (*error_code == MPI_SUCCESS) { ++ MPI_Type_size(datatype, &typesize); ++ (*request)->nbytes = count * typesize; ++ } ++#endif ++} ++ ++ ++/* This function is for implementation convenience. It is not user-visible. ++ It takes care of the differences in the interface for nonblocking I/O ++ on various Unix machines! If wr==1 write, wr==0 read. */ ++ ++int ADIOI_LUSTRE_aio(ADIO_File fd, void *buf, int len, ADIO_Offset offset, ++ int wr, void *handle) ++{ ++ int err=-1, fd_sys; ++ ++#ifndef NO_AIO ++ int error_code; ++#ifdef AIO_SUN ++ aio_result_t *result; ++#else ++ struct aiocb *aiocbp; ++#endif ++#endif ++ ++ fd_sys = fd->fd_sys; ++ ++#ifdef AIO_SUN ++ result = (aio_result_t *) ADIOI_Malloc(sizeof(aio_result_t)); ++ result->aio_return = AIO_INPROGRESS; ++ if (wr) err = aiowrite(fd_sys, buf, len, offset, SEEK_SET, result); ++ else err = aioread(fd_sys, buf, len, offset, SEEK_SET, result); ++ ++ if (err == -1) { ++ if (errno == EAGAIN) { ++ /* the man pages say EPROCLIM, but in reality errno is set to EAGAIN! */ ++ ++ /* exceeded the max. no. of outstanding requests. ++ complete all previous async. requests and try again.*/ ++ ++ ADIOI_Complete_async(&error_code); ++ if (wr) err = aiowrite(fd_sys, buf, len, offset, SEEK_SET, result); ++ else err = aioread(fd_sys, buf, len, offset, SEEK_SET, result); ++ ++ while (err == -1) { ++ if (errno == EAGAIN) { ++ /* sleep and try again */ ++ sleep(1); ++ if (wr) err = aiowrite(fd_sys, buf, len, offset, SEEK_SET, result); ++ else err = aioread(fd_sys, buf, len, offset, SEEK_SET, result); ++ } ++ else { ++ FPRINTF(stderr, "Unknown errno %d in ADIOI_LUSTRE_aio\n", errno); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ } ++ else { ++ FPRINTF(stderr, "Unknown errno %d in ADIOI_LUSTRE_aio\n", errno); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ ++ *((aio_result_t **) handle) = result; ++#endif ++ ++#ifdef NO_FD_IN_AIOCB ++/* IBM */ ++ aiocbp = (struct aiocb *) ADIOI_Malloc(sizeof(struct aiocb)); ++ aiocbp->aio_whence = SEEK_SET; ++ aiocbp->aio_offset = offset; ++ aiocbp->aio_buf = buf; ++ aiocbp->aio_nbytes = len; ++ if (wr) err = aio_write(fd_sys, aiocbp); ++ else err = aio_read(fd_sys, aiocbp); ++ ++ if (err == -1) { ++ if (errno == EAGAIN) { ++ /* exceeded the max. no. of outstanding requests. ++ complete all previous async. requests and try again. */ ++ ++ ADIOI_Complete_async(&error_code); ++ if (wr) err = aio_write(fd_sys, aiocbp); ++ else err = aio_read(fd_sys, aiocbp); ++ ++ while (err == -1) { ++ if (errno == EAGAIN) { ++ /* sleep and try again */ ++ sleep(1); ++ if (wr) err = aio_write(fd_sys, aiocbp); ++ else err = aio_read(fd_sys, aiocbp); ++ } ++ else { ++ FPRINTF(stderr, "Unknown errno %d in ADIOI_LUSTRE_aio\n", errno); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ } ++ else { ++ FPRINTF(stderr, "Unknown errno %d in ADIOI_LUSTRE_aio\n", errno); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ ++ *((struct aiocb **) handle) = aiocbp; ++ ++#elif (!defined(NO_AIO) && !defined(AIO_SUN)) ++/* DEC, SGI IRIX 5 and 6 */ ++ ++ aiocbp = (struct aiocb *) ADIOI_Calloc(sizeof(struct aiocb), 1); ++ aiocbp->aio_fildes = fd_sys; ++ aiocbp->aio_offset = offset; ++ aiocbp->aio_buf = buf; ++ aiocbp->aio_nbytes = len; ++ ++#ifdef AIO_PRIORITY_DEFAULT ++/* DEC */ ++ aiocbp->aio_reqprio = AIO_PRIO_DFL; /* not needed in DEC Unix 4.0 */ ++ aiocbp->aio_sigevent.sigev_signo = 0; ++#else ++ aiocbp->aio_reqprio = 0; ++#endif ++ ++#ifdef AIO_SIGNOTIFY_NONE ++/* SGI IRIX 6 */ ++ aiocbp->aio_sigevent.sigev_notify = SIGEV_NONE; ++#else ++ aiocbp->aio_sigevent.sigev_signo = 0; ++#endif ++ ++ if (wr) err = aio_write(aiocbp); ++ else err = aio_read(aiocbp); ++ ++ if (err == -1) { ++ if (errno == EAGAIN) { ++ /* exceeded the max. no. of outstanding requests. ++ complete all previous async. requests and try again. */ ++ ++ ADIOI_Complete_async(&error_code); ++ if (wr) err = aio_write(aiocbp); ++ else err = aio_read(aiocbp); ++ ++ while (err == -1) { ++ if (errno == EAGAIN) { ++ /* sleep and try again */ ++ sleep(1); ++ if (wr) err = aio_write(aiocbp); ++ else err = aio_read(aiocbp); ++ } ++ else { ++ FPRINTF(stderr, "Unknown errno %d in ADIOI_LUSTRE_aio\n", errno); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ } ++ else { ++ FPRINTF(stderr, "Unknown errno %d in ADIOI_LUSTRE_aio\n", errno); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++ } ++ } ++ ++ *((struct aiocb **) handle) = aiocbp; ++#endif ++ ++ return err; ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_open.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_open.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_open.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_open.c 2005-12-06 11:54:37.906127861 -0500 +@@ -0,0 +1,100 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_open.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_Open(ADIO_File fd, int *error_code) ++{ ++ int perm, old_mask, amode; ++ struct lov_user_md lum = { 0 }; ++ char *value; ++ ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_OPEN"; ++#endif ++ ++ if (fd->perm == ADIO_PERM_NULL) { ++ old_mask = umask(022); ++ umask(old_mask); ++ perm = old_mask ^ 0666; ++ } ++ else perm = fd->perm; ++ ++ amode = 0; ++ if (fd->access_mode & ADIO_CREATE) ++ amode = amode | O_CREAT; ++ if (fd->access_mode & ADIO_RDONLY) ++ amode = amode | O_RDONLY; ++ if (fd->access_mode & ADIO_WRONLY) ++ amode = amode | O_WRONLY; ++ if (fd->access_mode & ADIO_RDWR) ++ amode = amode | O_RDWR; ++ if (fd->access_mode & ADIO_EXCL) ++ amode = amode | O_EXCL; ++ ++ fd->fd_sys = open(fd->filename, amode, perm); ++ ++ if (fd->fd_sys != -1) { ++ int err; ++ ++ value = (char *) ADIOI_Malloc((MPI_MAX_INFO_VAL+1)*sizeof(char)); ++ ++ /* get file striping information and set it in info */ ++ lum.lmm_magic = LOV_USER_MAGIC; ++ err = ioctl(fd->fd_sys, LL_IOC_LOV_GETSTRIPE, (void *) &lum); ++ ++ if (!err) { ++ sprintf(value, "%d", lum.lmm_stripe_size); ++ MPI_Info_set(fd->info, "striping_unit", value); ++ ++ sprintf(value, "%d", lum.lmm_stripe_count); ++ MPI_Info_set(fd->info, "striping_factor", value); ++ ++ sprintf(value, "%d", lum.lmm_stripe_offset); ++ MPI_Info_set(fd->info, "start_iodevice", value); ++ } ++ ADIOI_Free(value); ++ ++ if (fd->access_mode & ADIO_APPEND) ++ fd->fp_ind = fd->fp_sys_posn = lseek(fd->fd_sys, 0, SEEK_END); ++ } ++ ++ ++ if ((fd->fd_sys != -1) && (fd->access_mode & ADIO_APPEND)) ++ fd->fp_ind = fd->fp_sys_posn = lseek(fd->fd_sys, 0, SEEK_END); ++ ++ if (fd->fd_sys == -1) { ++#ifdef MPICH2 ++ if (errno == ENAMETOOLONG) ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_BAD_FILE, "**filenamelong", "**filenamelong %s %d", fd->filename, strlen(fd->filename) ); ++ else if (errno == ENOENT) ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_NO_SUCH_FILE, "**filenoexist", "**filenoexist %s", fd->filename ); ++ else if (errno == ENOTDIR || errno == ELOOP) ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_BAD_FILE, "**filenamedir", "**filenamedir %s", fd->filename ); ++ else if (errno == EACCES) { ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_ACCESS, "**fileaccess", "**fileaccess %s", ++ fd->filename ); ++ } ++ else if (errno == EROFS) { ++ /* Read only file or file system and write access requested */ ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_READ_ONLY, "**ioneedrd", 0 ); ++ } ++ else { ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++ } ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(ADIO_FILE_NULL, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_rdcoll.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_rdcoll.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_rdcoll.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_rdcoll.c 2005-12-06 11:54:37.907127727 -0500 +@@ -0,0 +1,18 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_rdcoll.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_ReadStridedColl(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code) ++{ ++ ADIOI_GEN_ReadStridedColl(fd, buf, count, datatype, file_ptr_type, ++ offset, status, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_read.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_read.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_read.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_read.c 2005-12-06 11:54:37.907127727 -0500 +@@ -0,0 +1,67 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_read.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_ReadContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int *error_code) ++{ ++ int err=-1, datatype_size, len; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_READCONTIG"; ++#endif ++ ++ MPI_Type_size(datatype, &datatype_size); ++ len = datatype_size * count; ++ ++ if (file_ptr_type == ADIO_EXPLICIT_OFFSET) { ++ if (fd->fp_sys_posn != offset) ++ lseek(fd->fd_sys, offset, SEEK_SET); ++ err = read(fd->fd_sys, buf, len); ++ fd->fp_sys_posn = offset + len; ++ /* individual file pointer not updated */ ++ } ++ else { /* read from curr. location of ind. file pointer */ ++ if (fd->fp_sys_posn != fd->fp_ind) ++ lseek(fd->fd_sys, fd->fp_ind, SEEK_SET); ++ err = read(fd->fd_sys, buf, len); ++ fd->fp_ind += err; ++ fd->fp_sys_posn = fd->fp_ind; ++ } ++ ++#ifdef HAVE_STATUS_SET_BYTES ++ if (err != -1) MPIR_Status_set_bytes(status, datatype, err); ++#endif ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++} ++ ++ ++ ++ ++void ADIOI_LUSTRE_ReadStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code) ++{ ++ ADIOI_GEN_ReadStrided(fd, buf, count, datatype, file_ptr_type, ++ offset, status, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_resize.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_resize.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_resize.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_resize.c 2005-12-06 11:54:37.909127460 -0500 +@@ -0,0 +1,32 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_resize.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_Resize(ADIO_File fd, ADIO_Offset size, int *error_code) ++{ ++ int err; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_RESIZE"; ++#endif ++ ++ err = ftruncate(fd->fd_sys, size); ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_seek.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_seek.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_seek.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_seek.c 2005-12-06 11:54:37.911127194 -0500 +@@ -0,0 +1,15 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_seek.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++ADIO_Offset ADIOI_LUSTRE_SeekIndividual(ADIO_File fd, ADIO_Offset offset, ++ int whence, int *error_code) ++{ ++ return ADIOI_GEN_SeekIndividual(fd, offset, whence, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wait.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wait.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wait.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wait.c 2005-12-06 11:54:37.914126794 -0500 +@@ -0,0 +1,188 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_wait.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_ReadComplete(ADIO_Request *request, ADIO_Status *status, int *error_code) ++{ ++#ifndef NO_AIO ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_READCOMPLETE"; ++#endif ++#ifdef AIO_SUN ++ aio_result_t *result=0, *tmp; ++#else ++ int err; ++#endif ++#ifdef AIO_HANDLE_IN_AIOCB ++ struct aiocb *tmp1; ++#endif ++#endif ++ ++ if (*request == ADIO_REQUEST_NULL) { ++ *error_code = MPI_SUCCESS; ++ return; ++ } ++ ++#ifdef AIO_SUN ++ if ((*request)->queued) { /* dequeue it */ ++ tmp = (aio_result_t *) (*request)->handle; ++ while (tmp->aio_return == AIO_INPROGRESS) usleep(1000); ++ /* sleep for 1 ms., until done. Is 1 ms. a good number? */ ++ /* when done, dequeue any one request */ ++ result = (aio_result_t *) aiowait(0); ++ ++ (*request)->nbytes = tmp->aio_return; ++ ++ if (tmp->aio_return == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(tmp->aio_errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(tmp->aio_errno)); ++ ADIOI_Error((*request)->fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++ ++/* aiowait only dequeues a request. The completion of a request can be ++ checked by just checking the aio_return flag in the handle passed ++ to the original aioread()/aiowrite(). Therefore, I need to ensure ++ that aiowait() is called exactly once for each previous ++ aioread()/aiowrite(). This is also taken care of in ADIOI_xxxDone */ ++ } ++ else *error_code = MPI_SUCCESS; ++ ++#ifdef HAVE_STATUS_SET_BYTES ++ if ((*request)->nbytes != -1) ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++ ++#endif ++ ++#ifdef AIO_HANDLE_IN_AIOCB ++/* IBM */ ++ if ((*request)->queued) { ++ do { ++ err = aio_suspend(1, (struct aiocb **) &((*request)->handle)); ++ } while ((err == -1) && (errno == EINTR)); ++ ++ tmp1 = (struct aiocb *) (*request)->handle; ++ if (err != -1) { ++ err = aio_return(tmp1->aio_handle); ++ (*request)->nbytes = err; ++ errno = aio_error(tmp1->aio_handle); ++ } ++ else (*request)->nbytes = -1; ++ ++/* on DEC, it is required to call aio_return to dequeue the request. ++ IBM man pages don't indicate what function to use for dequeue. ++ I'm assuming it is aio_return! POSIX says aio_return may be called ++ only once on a given handle. */ ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error((*request)->fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++ } /* if ((*request)->queued) */ ++ else *error_code = MPI_SUCCESS; ++ ++#ifdef HAVE_STATUS_SET_BYTES ++ if ((*request)->nbytes != -1) ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++ ++#elif (!defined(NO_AIO) && !defined(AIO_SUN)) ++/* DEC, SGI IRIX 5 and 6 */ ++ if ((*request)->queued) { ++ do { ++ err = aio_suspend((const aiocb_t **) &((*request)->handle), 1, 0); ++ } while ((err == -1) && (errno == EINTR)); ++ ++ if (err != -1) { ++ err = aio_return((struct aiocb *) (*request)->handle); ++ (*request)->nbytes = err; ++ errno = aio_error((struct aiocb *) (*request)->handle); ++ } ++ else (*request)->nbytes = -1; ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++ return; ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else /* MPICH-1 */ ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error((*request)->fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++ } /* if ((*request)->queued) */ ++ else *error_code = MPI_SUCCESS; ++#ifdef HAVE_STATUS_SET_BYTES ++ if ((*request)->nbytes != -1) ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++#endif ++ ++#ifndef NO_AIO ++ if ((*request)->queued != -1) { ++ ++ /* queued = -1 is an internal hack used when the request must ++ be completed, but the request object should not be ++ freed. This is used in ADIOI_Complete_async, because the user ++ will call MPI_Wait later, which would require status to ++ be filled. Ugly but works. queued = -1 should be used only ++ in ADIOI_Complete_async. ++ This should not affect the user in any way. */ ++ ++ /* if request is still queued in the system, it is also there ++ on ADIOI_Async_list. Delete it from there. */ ++ if ((*request)->queued) ADIOI_Del_req_from_list(request); ++ ++ (*request)->fd->async_count--; ++ if ((*request)->handle) ADIOI_Free((*request)->handle); ++ ADIOI_Free_request((ADIOI_Req_node *) (*request)); ++ *request = ADIO_REQUEST_NULL; ++ } ++ ++#else ++/* HP, FreeBSD, Linux */ ++ ++#ifdef HAVE_STATUS_SET_BYTES ++ MPIR_Status_set_bytes(status, (*request)->datatype, (*request)->nbytes); ++#endif ++ (*request)->fd->async_count--; ++ ADIOI_Free_request((ADIOI_Req_node *) (*request)); ++ *request = ADIO_REQUEST_NULL; ++ *error_code = MPI_SUCCESS; ++#endif ++} ++ ++ ++void ADIOI_LUSTRE_WriteComplete(ADIO_Request *request, ADIO_Status *status, int *error_code) ++{ ++ ADIOI_LUSTRE_ReadComplete(request, status, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wrcoll.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wrcoll.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wrcoll.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_wrcoll.c 2005-12-06 11:54:37.914126794 -0500 +@@ -0,0 +1,18 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_wrcoll.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_WriteStridedColl(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code) ++{ ++ ADIOI_GEN_WriteStridedColl(fd, buf, count, datatype, file_ptr_type, ++ offset, status, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_write.c mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_write.c +--- mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_write.c 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/ad_lustre_write.c 2005-12-06 11:54:37.914126794 -0500 +@@ -0,0 +1,66 @@ ++/* -*- Mode: C; c-basic-offset:4 ; -*- */ ++/* ++ * $Id: ad_lustre_write.c,v 1.1.1.1 2004/11/04 11:03:38 liam Exp $ ++ * ++ * Copyright (C) 1997 University of Chicago. ++ * See COPYRIGHT notice in top-level directory. ++ */ ++ ++#include "ad_lustre.h" ++ ++void ADIOI_LUSTRE_WriteContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int *error_code) ++{ ++ int err=-1, datatype_size, len; ++#if defined(MPICH2) || !defined(PRINT_ERR_MSG) ++ static char myname[] = "ADIOI_LUSTRE_WRITECONTIG"; ++#endif ++ ++ MPI_Type_size(datatype, &datatype_size); ++ len = datatype_size * count; ++ ++ if (file_ptr_type == ADIO_EXPLICIT_OFFSET) { ++ if (fd->fp_sys_posn != offset) ++ lseek(fd->fd_sys, offset, SEEK_SET); ++ err = write(fd->fd_sys, buf, len); ++ fd->fp_sys_posn = offset + err; ++ /* individual file pointer not updated */ ++ } ++ else { /* write from curr. location of ind. file pointer */ ++ if (fd->fp_sys_posn != fd->fp_ind) ++ lseek(fd->fd_sys, fd->fp_ind, SEEK_SET); ++ err = write(fd->fd_sys, buf, len); ++ fd->fp_ind += err; ++ fd->fp_sys_posn = fd->fp_ind; ++ } ++ ++#ifdef HAVE_STATUS_SET_BYTES ++ if (err != -1 && status) MPIR_Status_set_bytes(status, datatype, err); ++#endif ++ ++ if (err == -1) { ++#ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**io", ++ "**io %s", strerror(errno)); ++#elif defined(PRINT_ERR_MSG) ++ *error_code = MPI_ERR_UNKNOWN; ++#else ++ *error_code = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ADIO_ERROR, ++ myname, "I/O Error", "%s", strerror(errno)); ++ ADIOI_Error(fd, *error_code, myname); ++#endif ++ } ++ else *error_code = MPI_SUCCESS; ++} ++ ++ ++ ++void ADIOI_LUSTRE_WriteStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code) ++{ ++ ADIOI_GEN_WriteStrided(fd, buf, count, datatype, file_ptr_type, ++ offset, status, error_code); ++} +diff -r -u --new-file mpich-1.2.6/romio/adio/ad_lustre/Makefile.in mpich-1.2.6/romio/adio/ad_lustre/Makefile.in +--- mpich-1.2.6/romio/adio/ad_lustre/Makefile.in 1969-12-31 19:00:00.000000000 -0500 ++++ mpich-1.2.6/romio/adio/ad_lustre/Makefile.in 2005-12-06 11:54:37.883130927 -0500 +@@ -0,0 +1,47 @@ ++CC = @CC@ ++AR = @AR@ ++LIBNAME = @LIBNAME@ ++srcdir = @srcdir@ ++CC_SHL = @CC_SHL@ ++SHLIBNAME = @SHLIBNAME@ ++ ++INCLUDE_DIR = -I@MPI_INCLUDE_DIR@ -I${srcdir}/../include -I../include ++CFLAGS = @CFLAGS@ $(INCLUDE_DIR) ++ ++C_COMPILE_SHL = $(CC_SHL) @CFLAGS@ $(INCLUDE_DIR) ++ ++@VPATH@ ++ ++AD_LUSTRE_OBJECTS = ad_lustre_close.o ad_lustre_read.o \ ++ ad_lustre_open.o ad_lustre_write.o ad_lustre_done.o \ ++ ad_lustre_fcntl.o ad_lustre_iread.o ad_lustre_iwrite.o ad_lustre_wait.o \ ++ ad_lustre_resize.o ad_lustre_hints.o \ ++ ad_lustre.o ++ ++ ++default: $(LIBNAME) ++ @if [ "@ENABLE_SHLIB@" != "none" ] ; then \ ++ $(MAKE) $(SHLIBNAME).la ;\ ++ fi ++ ++.SUFFIXES: $(SUFFIXES) .p .lo ++ ++.c.o: ++ $(CC) $(CFLAGS) -c $< ++.c.lo: ++ $(C_COMPILE_SHL) -c $< ++ @mv -f $*.o $*.lo ++ ++$(LIBNAME): $(AD_LUSTRE_OBJECTS) ++ $(AR) $(LIBNAME) $(AD_LUSTRE_OBJECTS) ++ ++AD_LUSTRE_LOOBJECTS=$(AD_LUSTRE_OBJECTS:.o=.lo) ++$(SHLIBNAME).la: $(AD_LUSTRE_LOOBJECTS) ++ $(AR) $(SHLIBNAME).la $(AD_LUSTRE_LOOBJECTS) ++ ++coverage: ++ -@for file in ${AD_LUSTRE_OBJECTS:.o=.c} ; do \ ++ gcov -b -f $$file ; done ++ ++clean: ++ @rm -f *.o *.lo +--- mpich-1.2.6/romio/Makefile.in 2004-01-27 18:27:35.000000000 -0500 ++++ mpich-1.2.6/romio/Makefile.in 2005-12-06 11:54:38.000000000 -0500 +@@ -14,7 +14,7 @@ DIRS = mpi-io adio/common + MPIO_DIRS = mpi-io + EXTRA_SRC_DIRS = @EXTRA_SRC_DIRS@ + FILE_SYS_DIRS = @FILE_SYS_DIRS@ +-ALL_DIRS = mpi-io mpi-io/fortran mpi2-other/info mpi2-other/info/fortran mpi2-other/array mpi2-other/array/fortran adio/common adio/ad_pfs adio/ad_piofs adio/ad_nfs adio/ad_ufs adio/ad_xfs adio/ad_hfs adio/ad_sfs adio/ad_testfs adio/ad_pvfs adio/ad_pvfs2 test ++ALL_DIRS = mpi-io mpi-io/fortran mpi2-other/info mpi2-other/info/fortran mpi2-other/array mpi2-other/array/fortran adio/common adio/ad_pfs adio/ad_piofs adio/ad_nfs adio/ad_ufs adio/ad_xfs adio/ad_hfs adio/ad_sfs adio/ad_testfs adio/ad_pvfs adio/ad_pvfs2 adio/ad_lustre test + SHELL = /bin/sh + + @VPATH@ +--- mpich-1.2.6/romio/configure.in 2004-08-02 09:37:31.000000000 -0400 ++++ mpich-1.2.6/romio/configure.in 2005-12-06 11:54:38.000000000 -0500 +@@ -90,7 +90,7 @@ MPIO_REQ_REAL_POBJECTS="_iotest.o _iowai + # + have_aio=no + # +-known_filesystems="nfs ufs pfs piofs pvfs pvfs2 testfs xfs hfs sfs" ++known_filesystems="nfs ufs pfs piofs pvfs pvfs2 testfs xfs hfs sfs lustre" + known_mpi_impls="mpich_mpi sgi_mpi hp_mpi cray_mpi lam_mpi" + # + # Defaults +@@ -1270,6 +1270,9 @@ fi + if test -n "$file_system_testfs"; then + AC_DEFINE(ROMIO_TESTFS,1,[Define for TESTFS]) + fi ++if test -n "$file_system_lustre"; then ++ AC_DEFINE(ROMIO_LUSTRE,1,[Define for LUSTRE]) ++fi + if test -n "$file_system_piofs"; then + AC_DEFINE(PIOFS,1,[Define for PIOFS]) + USER_CFLAGS="$USER_CFLAGS -bI:/usr/include/piofs/piofs.exp" +@@ -1634,7 +1637,7 @@ AC_OUTPUT(Makefile localdefs mpi-io/Make + adio/ad_nfs/Makefile adio/ad_ufs/Makefile \ + adio/ad_xfs/Makefile adio/ad_hfs/Makefile \ + adio/ad_sfs/Makefile adio/ad_pfs/Makefile \ +- adio/ad_testfs/Makefile adio/ad_pvfs/Makefile \ ++ adio/ad_testfs/Makefile adio/ad_lustre/Makefile adio/ad_pvfs/Makefile \ + adio/ad_pvfs2/Makefile adio/ad_piofs/Makefile \ + mpi-io/fortran/Makefile mpi2-other/info/fortran/Makefile \ + mpi2-other/array/fortran/Makefile test/fmisc.f \ +--- mpich-1.2.6/romio/configure 2004-08-04 12:08:28.000000000 -0400 ++++ mpich-1.2.6/romio/configure 2005-12-06 11:54:38.000000000 -0500 +@@ -623,7 +623,7 @@ MPIO_REQ_REAL_POBJECTS="_iotest.o _iowai + # + have_aio=no + # +-known_filesystems="nfs ufs pfs piofs pvfs pvfs2 testfs xfs hfs sfs" ++known_filesystems="nfs ufs pfs piofs pvfs pvfs2 testfs lustre xfs hfs sfs" + known_mpi_impls="mpich_mpi sgi_mpi hp_mpi cray_mpi lam_mpi" + # + # Defaults +@@ -4022,6 +4022,13 @@ if test -n "$file_system_testfs"; then + EOF + + fi ++if test -n "$file_system_lustre"; then ++ cat >> confdefs.h <<\EOF ++#define LUSTRE 1 ++EOF ++ ++fi ++ + if test -n "$file_system_piofs"; then + cat >> confdefs.h <<\EOF + #define PIOFS 1 +@@ -4746,7 +4753,7 @@ trap 'rm -fr `echo "Makefile localdefs m + adio/ad_xfs/Makefile adio/ad_hfs/Makefile \ + adio/ad_sfs/Makefile adio/ad_pfs/Makefile \ + adio/ad_testfs/Makefile adio/ad_pvfs/Makefile \ +- adio/ad_pvfs2/Makefile adio/ad_piofs/Makefile \ ++ adio/ad_pvfs2/Makefile adio/ad_piofs/Makefile adio/ad_lustre/Makefile\ + mpi-io/fortran/Makefile mpi2-other/info/fortran/Makefile \ + mpi2-other/array/fortran/Makefile test/fmisc.f \ + test/fcoll_test.f test/pfcoll_test.f test/fperf.f adio/include/romioconf.h" | sed "s/:[^ ]*//g"` conftest*; exit 1' 1 2 15 +@@ -4912,7 +4919,7 @@ CONFIG_FILES=\${CONFIG_FILES-"Makefile l + adio/ad_nfs/Makefile adio/ad_ufs/Makefile \ + adio/ad_xfs/Makefile adio/ad_hfs/Makefile \ + adio/ad_sfs/Makefile adio/ad_pfs/Makefile \ +- adio/ad_testfs/Makefile adio/ad_pvfs/Makefile \ ++ adio/ad_testfs/Makefile adio/ad_lustre/Makefile adio/ad_pvfs/Makefile \ + adio/ad_pvfs2/Makefile adio/ad_piofs/Makefile \ + mpi-io/fortran/Makefile mpi2-other/info/fortran/Makefile \ + mpi2-other/array/fortran/Makefile test/fmisc.f \ +--- mpich-1.2.6/romio/adio/include/romioconf.h.in 2004-08-04 12:08:28.000000000 -0400 ++++ mpich-1.2.6/romio/adio/include/romioconf.h.in 2005-12-06 11:54:38.000000000 -0500 +@@ -192,6 +192,9 @@ + /* Define for TESTFS */ + #undef ROMIO_TESTFS + ++/* Define for LUSTRE */ ++#undef LUSTRE ++ + /* Define for PIOFS */ + #undef PIOFS + +--- mpich-1.2.6/romio/adio/include/mpio_error.h 2002-11-15 11:26:23.000000000 -0500 ++++ mpich-1.2.6/romio/adio/include/mpio_error.h 2005-12-06 11:54:38.000000000 -0500 +@@ -62,6 +62,7 @@ + #define MPIR_ERR_FILETYPE 33 + #define MPIR_ERR_NO_NTFS 35 + #define MPIR_ERR_NO_TESTFS 36 ++#define MPIR_ERR_NO_LUSTRE 37 + + /* MPI_ERR_COMM */ + #ifndef MPIR_ERR_COMM_NULL +--- mpich-1.2.6/romio/adio/include/adioi_fs_proto.h 2003-06-24 18:48:23.000000000 -0400 ++++ mpich-1.2.6/romio/adio/include/adioi_fs_proto.h 2005-12-06 11:54:38.000000000 -0500 +@@ -261,6 +261,68 @@ ADIO_Offset ADIOI_UFS_SeekIndividual(ADI + void ADIOI_UFS_SetInfo(ADIO_File fd, MPI_Info users_info, int *error_code); + #endif + ++#ifdef LUSTRE ++extern struct ADIOI_Fns_struct ADIO_LUSTRE_operations; ++ ++void ADIOI_LUSTRE_Open(ADIO_File fd, int *error_code); ++void ADIOI_LUSTRE_Close(ADIO_File fd, int *error_code); ++void ADIOI_LUSTRE_ReadContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_WriteContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_IwriteContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int ++ *error_code); ++void ADIOI_LUSTRE_IreadContig(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int ++ *error_code); ++int ADIOI_LUSTRE_ReadDone(ADIO_Request *request, ADIO_Status *status, int ++ *error_code); ++int ADIOI_LUSTRE_WriteDone(ADIO_Request *request, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_ReadComplete(ADIO_Request *request, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_WriteComplete(ADIO_Request *request, ADIO_Status *status, ++ int *error_code); ++void ADIOI_LUSTRE_Fcntl(ADIO_File fd, int flag, ADIO_Fcntl_t *fcntl_struct, int ++ *error_code); ++void ADIOI_LUSTRE_WriteStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_ReadStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_WriteStridedColl(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_ReadStridedColl(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Status *status, int ++ *error_code); ++void ADIOI_LUSTRE_IreadStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int ++ *error_code); ++void ADIOI_LUSTRE_IwriteStrided(ADIO_File fd, void *buf, int count, ++ MPI_Datatype datatype, int file_ptr_type, ++ ADIO_Offset offset, ADIO_Request *request, int ++ *error_code); ++void ADIOI_LUSTRE_Flush(ADIO_File fd, int *error_code); ++void ADIOI_LUSTRE_Resize(ADIO_File fd, ADIO_Offset size, int *error_code); ++ADIO_Offset ADIOI_LUSTRE_SeekIndividual(ADIO_File fd, ADIO_Offset offset, ++ int whence, int *error_code); ++void ADIOI_LUSTRE_SetInfo(ADIO_File fd, MPI_Info users_info, int *error_code); ++#endif ++ + #ifdef ROMIO_NTFS + extern struct ADIOI_Fns_struct ADIO_NTFS_operations; + +--- mpich-1.2.6/romio/adio/include/adio.h 2004-06-07 13:59:57.000000000 -0400 ++++ mpich-1.2.6/romio/adio/include/adio.h 2005-12-06 11:54:38.000000000 -0500 +@@ -276,6 +276,7 @@ typedef struct { + #define ADIO_NTFS 158 /* NTFS for Windows NT */ + #define ADIO_TESTFS 159 /* fake file system for testing */ + #define ADIO_PVFS2 160 /* PVFS2: 2nd generation PVFS */ ++#define ADIO_LUSTRE 161 /* Lustre */ + + #define ADIO_SEEK_SET SEEK_SET + #define ADIO_SEEK_CUR SEEK_CUR +--- mpich-1.2.6/romio/adio/common/setfn.c 2003-06-24 18:48:18.000000000 -0400 ++++ mpich-1.2.6/romio/adio/common/setfn.c 2005-12-06 11:54:38.000000000 -0500 +@@ -114,6 +114,16 @@ void ADIOI_SetFunctions(ADIO_File fd) + #endif + break; + ++ case ADIO_LUSTRE: ++#ifdef LUSTRE ++ *(fd->fns) = ADIO_LUSTRE_operations; ++#else ++ FPRINTF(stderr, "ADIOI_SetFunctions: ROMIO has not been configured to use the LUSTRE file system\n"); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++#endif ++ break; ++ ++ + default: + FPRINTF(stderr, "ADIOI_SetFunctions: Unsupported file system type\n"); + MPI_Abort(MPI_COMM_WORLD, 1); +--- mpich-1.2.6/romio/adio/common/ad_fstype.c 2003-09-04 16:24:44.000000000 -0400 ++++ mpich-1.2.6/romio/adio/common/ad_fstype.c 2005-12-06 11:54:38.000000000 -0500 +@@ -204,6 +204,11 @@ static void ADIO_FileSysType_fncall(char + } + } + #elif defined(LINUX) ++#warning use correct include ++# if defined (LUSTRE) ++#define LL_SUPER_MAGIC 0x0BD00BD0 ++# endif ++ + do { + err = statfs(filename, &fsbuf); + } while (err && (errno == ESTALE)); +@@ -218,6 +223,9 @@ static void ADIO_FileSysType_fncall(char + else { + /* FPRINTF(stderr, "%d\n", fsbuf.f_type);*/ + if (fsbuf.f_type == NFS_SUPER_MAGIC) *fstype = ADIO_NFS; ++# if defined (LUSTRE) ++ else if (fsbuf.f_type == LL_SUPER_MAGIC) *fstype = ADIO_LUSTRE; ++#endif + # if defined(ROMIO_PVFS) + else if (fsbuf.f_type == PVFS_SUPER_MAGIC) *fstype = ADIO_PVFS; + # endif +@@ -359,6 +367,11 @@ static void ADIO_FileSysType_prefix(char + { + *fstype = ADIO_TESTFS; + } ++ else if (!strncmp(filename, "lustre:", 7) ++ || !strncmp(filename, "LUSTRE:", 7)) ++ { ++ *fstype = ADIO_LUSTRE; ++ } + else { + #ifdef ROMIO_NTFS + *fstype = ADIO_NTFS; +@@ -644,6 +657,24 @@ void ADIO_ResolveFileType(MPI_Comm comm, + *ops = &ADIO_TESTFS_operations; + #endif + } ++ if (file_system == ADIO_LUSTRE) { ++#ifndef LUSTRE ++# ifdef MPICH2 ++ *error_code = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_IO, "**iofstypeunsupported", 0); ++ return; ++# elif defined(PRINT_ERR_MSG) ++ FPRINTF(stderr, "ADIO_ResolveFileType: ROMIO has not been configured to use the LUSTRE file system\n"); ++ MPI_Abort(MPI_COMM_WORLD, 1); ++# else /* MPICH-1 */ ++ myerrcode = MPIR_Err_setmsg(MPI_ERR_IO, MPIR_ERR_NO_LUSTRE, ++ myname, (char *) 0, (char *) 0); ++ *error_code = ADIOI_Error(MPI_FILE_NULL, myerrcode, myname); ++# endif ++ return; ++#else ++ *ops = &ADIO_LUSTRE_operations; ++#endif ++ } + *error_code = MPI_SUCCESS; + *fstype = file_system; + return; diff --git a/lustre/doc/Makefile.am b/lustre/doc/Makefile.am index 2b214f7..1d02c60 100644 --- a/lustre/doc/Makefile.am +++ b/lustre/doc/Makefile.am @@ -15,7 +15,7 @@ TEXEXPAND = texexpand SUFFIXES = .lin .lyx .pdf .ps .sgml .html .txt .tex .fig .eps .dvi if UTILS -man_MANS = lfs.1 lmc.1 lconf.8 lctl.8 +man_MANS = lustre.7 lfs.1 mount.lustre.8 mkfs.lustre.8 tunefs.lustre.8 lctl.8 endif LYXFILES= $(filter-out $(patsubst %.lin,%.lyx,$(wildcard *.lin)),\ @@ -23,8 +23,8 @@ LYXFILES= $(filter-out $(patsubst %.lin,%.lyx,$(wildcard *.lin)),\ CLEANFILES = *.aux *.tex *.log *.pdf -EXTRA_DIST = tex2pdf $(man_MANS) \ - $(LYXFILES) lfs.1 lmc.1 lconf.8 lctl.8 +EXTRA_DIST = tex2pdf lustre.7 mount.lustre.8 mkfs.lustre.8 tunefs.lustre.8 \ + $(LYXFILES) lfs.1 lctl.8 all: diff --git a/lustre/doc/lconf.8 b/lustre/doc/lconf.8 index 6143c6b..a6ca88a 100644 --- a/lustre/doc/lconf.8 +++ b/lustre/doc/lconf.8 @@ -4,48 +4,59 @@ lconf \- Lustre filesystem configuration utility .SH SYNOPSIS .br .B lconf -[--node ] [-d,--cleanup] [--noexec] [--gdb] [--nosetup] [--nomod] [-n,--noexec] [-v,--verbose] [-h,--help] -[options] --add [args] +[OPTIONS] .br .SH DESCRIPTION .B lconf -, when invoked configures a node following directives in the . There will be single configuration file for all the nodes in a single cluster. This file should be distributed to all the nodes in the cluster or kept in a location accessible to all the nodes. One option is to store the cluster configuration information in LDAP format on an LDAP server that can be reached from all the cluster nodes. +, when invoked configures a node following directives in the +.Can be used to control recovery and startup/shutdown +. There will be single configuration file for all the nodes in a +single cluster. This file should be distributed to all the nodes in +the cluster or kept in a location accessible to all the nodes. The XML file must be specified. When invoked with no options, lconf will attempt to configure the resources owned by the node it is invoked on .PP The arguments that can be used for lconf are: .PP .TP +--abort_recovery - Used to start Lustre when you are certian that +recovery will not succeed, as when an OST or MDS is disabled. +.TP +--acl Enable Access Control List support on the MDS +.TP +--allow_unprivileged_port Allows connections from unprivileged ports +.TP +--clientoptions +Additional options for mounting Lustre clients. Obsolete with +zeroconfig mounting.. +.TP --client_uuid The failed client (required for recovery). .TP ---clientoptions -Additional options for Lustre. +--clumanager Generate a Red Hat Clumanager configuration file for this +node. .TP --config -Cluster configuration name used for LDAP query +Cluster configuration name used for LDAP query (depreciated) .TP --conn_uuid The failed connection (required for recovery). .TP ---d|--cleanup +-d|--cleanup Unconfigure a node. The same config and --node argument used for configuration needs to be used for cleanup as well. This will attempt to undo all of the configuration steps done by lconf, including unloading the kernel modules. .TP --debug_path -Path to save debug dumps. +Path to save debug dumps.(default is /tmp/lustre-log) .TP --dump Dump the kernel debug log to the specified file before portals is unloaded during cleanup. .TP ---dump_path -Path to save debug dumps. Default is /tmp/lustre_log -.TP --failover -Used to shutdown without saving state. Default is 0. This will allow the node to give up service to another node for failover purposes. This will not be a clean shutdown. +Used to shutdown without saving state. This will allow the node to give up service to another node for failover purposes. This will not be a clean shutdown. .TP ---force -Forced unmounting and/or obd detach during cleanup. Default is 0. +-f|--force +Forced unmounting and/or obd detach during cleanup. .TP --gdb -Causes lconf to print a message and pause for 5 seconds after creating a gdb module script and before doing any Lustre configuration (the gdb module script is always created, however). +Causes lconf to create a gdb module script and pause 5 seconds before doing any Lustre configuration (the gdb module script is always created, however). .TP --gdb_script Full name of gdb debug script. Default is /tmp/ogdb. @@ -66,19 +77,29 @@ The UUID of the service to be ignored by a client mounting Lustre. Allows the cl Dump all ioctls to the specified file .TP --ldapurl -LDAP server URL +LDAP server URL. Depreciated +.TP +--lustre=src_dir +Specify the base directory for Lustre sources, this parameter will cause lconf to load the lustre modules from this source tree. .TP --lustre_upcall Set the location of the Lustre upcall scripts used by the client for recovery .TP ---lustre=src_dir -Specify the base directory for Lustre sources, this parameter will cause lconf to load the lustre modules from this soure tree. +--make_service_scripts Create per-service symlinks for use with clumanager HA software .TP --mds_ost_conn Open connections to OSTs on MDS. .TP --maxlevel -Perform configuration of devices and services up to level given. level can take the values net, dev, svc, fs. When used in conjunction with cleanup, services are torn down up to a certain level. Default is 100. +Perform configuration of devices and services up to level given. When +used in conjunction with cleanup, services are torn down up to a +certain level. +Levels are aproximatly like: +10 - network +20 - device, ldlm +30 - osd, mdd +40 - mds, ost +70 - mountpoint, echo_client, osc, mdc, lov .TP --minlevel Specify the minimum level of services to configure/cleanup. Default is 0. @@ -101,24 +122,36 @@ Only setup devices and services, do not load modules. --nosetup Only load modules, do not configure devices or services. .TP +--old_conf Start up service even though config logs appear outdated. +.TP --portals -Specify portals source directory. If this is a relative path, then it is assumed to be relative to lustre. +Specify portals source directory. If this is a relative path, then it +is assumed to be relative to lustre. (Depreciated) .TP --portals_upcall -Specify the location of the Portals upcall scripts used by the client for recovery +Specify the location of the Portals upcall scripts used by the client +for recovery (Depreciated) .TP --ptldebug debug-level This options can be used to set the required debug level. .TP +--quota +Enable quota support for client filesystem +.TP +--rawprimary For clumanager, device of the primary quorum +(default=/dev/raw/raw1) +.TP +--rawsecondary For clumanager, device of the secondary quorum (default=/dev/raw/raw2) +.TP --record Write config information on mds. .TP ---record_log -Specify the name of config record log. -.TP --record_device Specify MDS device name that will record the config commands. .TP +--record_log +Specify the name of config record log. +.TP --recover Recover a device. .TP @@ -131,6 +164,11 @@ Select a particular node for a service --service Shorthand for --group --select = .TP +--service_scripts For clumanager, directory containing per-service scripts (default=/etc/lustre/services) +.TP +--single_socket The socknal option. Uses only one socket instead of a +bundle. +.TP --subsystem Set the portals debug subsystem. .TP @@ -141,7 +179,10 @@ Specify the failed target (required for recovery). Set the recovery timeout period. .TP --upcall -Set the location of both Lustre and Portals upcall scripts used by the client for recovery +Set the location of both Lustre and Portals upcall scripts used by the +client for recovery +.TP +--user_xattr Enable user_xattr support on MDS .TP --verbose,-v Be verbose and show actions while going along. diff --git a/lustre/doc/lctl.8 b/lustre/doc/lctl.8 index 69c6ece..2015734 100644 --- a/lustre/doc/lctl.8 +++ b/lustre/doc/lctl.8 @@ -34,269 +34,213 @@ To get a complete listing of available commands, type help at the lctl prompt. For non-interactive single-threaded use, one uses the second invocation, which runs command after connecting to the device. -.B Network Configuration +.SS Network Configuration .TP ---net -Indicate the network type to be used for the operation. -.TP -network -Indicate what kind of network applies for the configuration commands that follow. +.BI network " |" +Start or stop LNET, or select a network type for other +.I +lctl +commands +.TP +.BI list_nids +Print all Network Identifiers on the local node .TP -interface_list +.BI which_nid " " +From a list of nids for a remote node, show which interface communication +will take place on. +.TP +.BI interface_list Print the interface entries. .TP -add_interface [netmask] +.BI add_interface " [netmask]" Add an interface entry. .TP -del_interface [ip] +.BI del_interface " [ip]" Delete an interface entry. .TP -peer_list +.BI peer_list Print the peer entries. .TP -add_peer +.BI add_peer " " Add a peer entry. .TP -del_peer [] [] [ks] +.BI del_peer " [] [] [ks] " Remove a peer entry. .TP -conn_list -Print all the connected remote nid. -.TP -connect [[ ] | ] -This will establish a connection to a remote network network id given by the hostname/port combination, or the elan id. -.TP -disconnect -Disconnect from a remote nid. +.BI conn_list +Print all the connected remote nids on a network. .TP -active_tx +.BI active_tx This command should print active transmits, and it is only used for elan network type. .TP -mynid [nid] -Informs the socknal of the local nid. It defaults to hostname for tcp networks and is automatically setup for elan/myrinet networks. -.TP -shownid -Print the local NID. -.TP -add_uuid -Associate a given UUID with an nid. -.TP -close_uuid -Disconnect a UUID. -.TP -del_uuid -Delete a UUID association. -.TP -add_route [target] +.BI add_route " [target] " Add an entry to the routing table for the given target. .TP -del_route +.BI del_route " " Delete an entry for the target from the routing table. .TP -set_route [